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Abstract Visual textures have played a key role in image 
understanding because they convey important semantics of 
images, and because texture representations that pool local 
image descriptors in an orderless manner have had a tremen¬ 
dous impact in diverse applications. In this paper we make 
several contributions to texture understanding. First, instead 
of focusing on texture instance and material category recog¬ 
nition, we propose a human-interpretable vocabulary of tex¬ 
ture attributes to describe common texture patterns, com¬ 
plemented by a new describable texture dataset for bench¬ 
marking. Second, we look at the problem of recognizing ma¬ 
terials and texture attributes in realistic imaging conditions, 
including when textures appear in clutter, developing corre¬ 
sponding benchmarks on top of the recently proposed Open- 
Surfaces dataset. Third, we revisit classic texture represena- 
tions, including bag-of-visual-words and the Fisher vectors, 
in the context of deep learning and show that these have ex¬ 
cellent efficiency and generalization properties if the con¬ 
volutional layers of a deep model are used as hlter banks. 
We obtain in this manner state-of-the-art performance in nu¬ 
merous datasets well beyond textures, an efficient method to 
apply deep features to image regions, as well as beneht in 
transferring features from one domain to another. 
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1 Introduction 

Visual representations based on orderless aggregations of 
local features, which were originally developed as texture 
descriptors, have had a widespread influence in image un¬ 
derstanding. These models include cornerstones such as the 
histograms of vector quantized Alter responses of Leung 
and Malik 1(5^ and later generalizations such as the bag- 
of-visual-words model of Csurka et al. Il26l and the Fisher 
vector of Perronnin et al. GS . These and other texture mod¬ 
els have been successfully applied to a huge variety of visual 
domains, including problems closer to “texture understand¬ 
ing” such as material recognition, as well as domains such 
as object categorization and face identification that share 
little of the appearance of textures. 

This paper makes three contributions to texture under¬ 
standing. The first one is to add a new semantic dimension 
to the problem. We depart from most of the previous works 
on visual textures, which focused on texture identification 
and material recognition, and look instead at the problem of 
describing generic texture patterns. We do so by developing 
a vocabulary of forty-seven texture attributes that describe 
a wide range of texture patterns; we also introduce a large 
dataset annotated with these attributes which we call the de¬ 
scribable texture dataset (Sect. |^. We then study whether 
texture attributes can be reliably estimated from images, and 
for what tasks are they useful. We demonstrate in particular 
two applications (Sect. |8.1| i: the first one is to use texture 
attributes as dimensions to organise large collections of tex¬ 
ture patterns, such as textile, wallpapers, and construction 
materials for search and retrieval. The second one is to use 
texture attributes as a compact basis of visual descriptors ap¬ 
plicable to other tasks such as material recognition. 

The second contribution of the paper is to introduce new 
data and benchmarks to study texture recognition in real¬ 
istic settings. While most of the earlier work on texture 
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recognition was carried out in carefully controlled condi¬ 
tions, more recent benchmarks such as the Flickr material 
dataset (FMD) ISTl have emphasized the importance of test¬ 
ing algorithms “in the wild”, for example on Internet im¬ 
ages. Flowever, even these datasets are somewhat removed 
from practical applications as they assume that textures fill 
the field of view, whereas in applications they are often ob¬ 
served in clutter. Here we leverage the excellent OpenSur- 
faces dataset ||8l to create novel benchmarks for materials 
and texture attributes where textures appear both in the wild 
and in clutter (Sect. [^, and demonstrate promising recogni¬ 
tion results in these challenging conditions. In ifTOl the same 
authors have also investigated material recognition using 
OpenSurfaces. 

The third contribution is technical and revisits classi¬ 
cal ideas in texture modeling in the light of modern local 
feature descriptors and pooling encoders. While texture rep¬ 
resentations were extensively used in most areas of image 
understanding, since the breakthrough work of lISTIl they 
have been replaced by deep Convolutional Neural Networks 
(CNNs). Often CNNs are applied to a problem by using 
transfer learning, in the sense that the network is first trained 
on a large-scale image classification task such as the Ima- 
geNet ILSVRC challenge 12^ . and then applied to another 
domain by exposing the output of a so-called “fully con¬ 
nected layer” as a general-purpose image representation. 
In this work we illustrate the many benefits of truncating 
these CNNs earlier, at the level of the convolutional layers 
(Sect. 1^. In this manner, one obtains powerful local image 
descriptors that, combined with traditional pooling encoders 
developed for texture representations, result in state-of-the- 
art recognition accuracy in a diverse set of visual domains, 
from material and texture attribute recognition, to coarse 
and fine grained object categorization and scene classifica¬ 
tion. We show that a benefit of this approach is that features 
transfer easily across domains even without fine-tuning the 
CNN on the target problem. Furthermore, pooling allows 
us to efficiently evaluate descriptors in image subregions, a 
fact that we exploit to recognize local image regions without 
recomputing CNN features from scratch. 


2 Describing textures with attributes 



porous, dotted, freckled, braided, interlaced, scaly, crosshatched, flecked, 
honeycombed knitted, woven, zigzagged studded, waffled 



wrinkled, crystalline, cracked, fibrous, freckled, 

flecked, smeared pitted, studded interlaced, smeared, swirly 


Fig. 1; We address the problem of describing textures by 
associating to them a collection of attributes. Our goal is to 
understand and generate automatically human-interpretable 
descriptions such as the examples above. 


This section looks at the problem of automatically de¬ 
scribing texture patterns using a general-purpose vocabulary 
of human-interpretable texture attributes, in a manner simi¬ 
lar to how we can vividly characterize the textures shown in 
Fig. □ The goal is to design algorithms capable of generat¬ 
ing and understanding texture descriptions involving a com¬ 
bination of describable attributes for each texture. Visual at¬ 
tributes have been extensively used in search, to understand 
complex user queries, in learning, to port textual informa¬ 
tion back to the visual domain, and in image description, to 
produce richer accounts of the content of images. Textural 
properties are an important component of the semantics of 
images, particularly for objects that are best characterized 
by a pattern, such as a scarf or the wings of a butterfly HIOIL 
Nevertheless, the attributes of visual textures have been in¬ 
vestigated only tangentially so far. Our aim is to fill this gap. 


A symmetric approach, using SIFT as local features and 
the IFV followed by fully-connected layers from a deep neu¬ 
ral network as a pooling mechanism, was proposed in ITT^ . 
obtaining similar results on VOC07. 

This paper is the archival version of two previous pub¬ 
lications ll23l and Il24l . Compared to these two papers, this 
new version adds a significant number of new experiments 
and a substantial amount of new discussion. 

The code and data for this paper are available on the 
project page, at http : / /www .robots . ox . ac . uk/ ~vgg/ 
research/deeptex 


Our first contribution is to introduce the Describable 
Textures Dataset (DTD) ll23l . a collection of real-world 
texture images annotated with one or more adjectives se¬ 
lected in a vocabulary of forty-seven English words. These 
adjectives, or describable texture attributes, are illustrated 
in Fig. 1^ and include words such as banded, cobwebbed, 
freckled, knitted, and zigzagged. Sect. 2.1 describes this data 


in more detail. Sect. 2.2 discusses the technical challenges 
we addressed while designing and collecting DTD, includ¬ 
ing how the forty-seven texture attributes were selected and 
how the problem of collecting numerous attributes for a vast 


number of images was addressed. Sect. 2.3 defines a number 
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of benchmark tasks in DTD. Finally, Sect. |2.5| relates DTD 
to existing texture datasets. 


2.1 The Describable Texture Dataset 

DTD investigates the problem of texture description, un¬ 
derstood as the recognition of describable texture attributes. 
This problem is complementary to standard texture analysis 
tasks such as texture identification and material recognition 
for the following reasons. While describable attributes are 
correlated with materials, attributes do not imply materials 
(e.g. veined may equally apply to leaves or marble) and ma¬ 
terials do not imply attributes (not all marbles are veined). 
This distinction is further elaborated in Sect. 12.41 

Describable attributes can be combined to create rich 
descriptions (Fig. marble can be veined, stratified and 
cracked at the same time), whereas a typical assumption is 
that textures are made of a single material. Describable at¬ 
tributes are subjective properties that depend on the imaged 
object as well as on human judgements, whereas materials 
are objective. In short, attributes capture properties of tex¬ 
tures complementary to materials, supporting human-centric 
tasks where describing textures is important. At the same 
time, we will show that texture attributes are also helpful in 
material recognition (Sect |8.1| l. 

DTD contains textures in the wild, i.e. texture images 
extracted from the web rather than captured or generated in 
a controlled setting. Textures fill the entire image in order to 
allow studying the problem of texture description indepen¬ 
dently of texture segmentation, which is instead addressed 
in Sect.j^ With 5,640 annotated texture images, this dataset 
aims at supporting real-world applications were the recog¬ 
nition of texture properties is a key component. Collecting 
images from the Internet is a common approach in catego¬ 
rization and object recognition, and was adopted in material 
recognition in FMD. This choice trades-off the systematic 
sampling of illumination and viewpoint variations existing 
in datasets such as CUReT, KTH-TIPS, Outex, and Drexel 
to capture real-world variations, reducing the gap with ap¬ 
plications. Furthermore, DTD captures empirically human 
judgements regarding the invariance of describable texture 
attributes; this invariance is not necessarily reflected in ma¬ 
terial properties. 


2.2 Dataset design and collection 

This section discusses how DTD was designed and col¬ 
lected, including; selecting the 47 attributes, finding at least 
120 representative images for each attribute, and collecting 
all the attribute labels for each image in the dataset. 


2.2.1 Selecting the describable attributes 

Psychological experiments suggest that, while there are a 
few hundred words that people commonly use to describe 
textures, this vocabulary is redundant and can be reduced to 
a much smaller number of representative words. Our starting 
point is the list of 98 words identified by Bhusan et al. m. 
Their seminal work aimed to achieve for texture recogni¬ 
tion the same that color words have achieved for describing 
color spaces iia. However, their work mainly focuses on 
the cognitive aspects of texture perception, including per¬ 
ceptual similarity and the identification of directions of per¬ 
ceptual texture variability. Since our interest is in the visual 
aspects of texture, words such as “corrugated” that are more 
related to surface shape or haptic properties were ignored. 
Other words such as “messy” that are highly subjective and 
do not necessarily correspond to well defined visual fea¬ 
tures were also ignored. After this screening phase we ana¬ 
lyzed the remaining words and merged similar ones such as 
“coiled”, “spiraled” and “corkscrewed” into a single term. 
This resulted in a set of 47 words, illustrated in Fig.|^ 

2.2.2 Bootstrapping the key images 

Given the 47 attributes, the next step consisted in collecting 
a sufficient number (120) of example images representa¬ 
tive of each attribute. Initially, a large initial pool of about 
a hundred-thousand images in total was downloaded from 
Google and Flickr by entering the attributes and related 
terms as search queries. Then Amazon Mechanical Turk 
(AMT) was used to remove low resolution, poor quality, wa¬ 
termarked images, or images that were not almost entirely 
filled with a texture. Next, detailed annotation instructions 
were created for each of the 47 attributes, including a dictio¬ 
nary definition of each concept and examples of textures that 
did and did not match the concept. Votes from three AMT 
annotators were collected for the candidate images of each 
attribute and a shortlist of about 200 highly-voted images 
was further manually checked by the authors to eliminate 
remaining errors. The result was a selection of 120 key rep¬ 
resentative images for each attribute. 

2.2.3 Sequential joint annotation 

So far only the key attribute of each image is known while 
any of the remaining 46 attributes may apply as well. Ex¬ 
haustively collecting annotations for 46 attributes and 5,640 
texture images is fairly expensive. To reduce this cost we 
propose to exploit the correlation and sparsity of the attribute 
occurrences (Fig.[^. For each attribute q, twelve key images 
are annotated exhaustively and used to estimate the proba¬ 
bility p{q'\q) that another attribute q' could co-exist with 
q. Then for the remaining key images of attribute q, only 
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banded M 


blotchy 


crosshatched 


crystalline 


grooved 


matted 


meshed 


potholed 


studded 


swirly 



braided 


dotted 


honeycombed 


paisley 


smeared 


veined 




I chequered 


freckled 


lacelike 


stained 


wrinkled 



Fig. 2; The 47 texture words in the describable texture dataset introduced in this paper. Two examples of each attribute are 
shown to illustrate the significant amount of variability in the data. 


annotations for attributes q' with non negligible probability 
are collected, assuming that the remaining attributes would 
not apply. In practice, this requires annotating around 10 at¬ 
tributes per texture instance, instead of 47. This procedure 
occasionally misses attribute annotations; Fig. evaluates 
attribute recall by 12-fold cross-validation on the 12 exhaus¬ 
tive annotations for a fixed budget of collecting 10 annota¬ 
tions per image. 


A further refinement is to suggest which attributes q' to 
annotate not just based on the prior p{q'\q), but also based 
on the appearance of an image x^. This was done by us¬ 
ing the attribute classifier learned in Sect. after Platt’s 
calibration CD on a held-out test set, the classifier score 
Cg'(xi) G K is transformed in a probability p{q'\xi) = 
cr(cq'(x)) where a{z) = 1/(1 -I- e~^) is the sigmoid func¬ 
tion. By construction, Platt’s calibration reflects the prior 
probability p{q') » po = 1/47 of g' on the validation set. To 
reflect the probability p{q'\q) instead, the score is adjusted 


as 


piq'\ii,q) oc a{cg'{£i)) x 


p{q'\q) 

l-p{q'\q) 


X 


1 -Po 
Po 


and used to And which attributes should be annotated for 
each image. As shown in Fig. for a fixed annotation 
budged this method increases attribute recall. 

Overall, with roughly 10 annotations per image it was 
possible to recover all of the attributes for at least 75% of 
the images, and miss one out of four (on average) for an¬ 
other 20%, while keeping the annotation cost to a reason¬ 
able level. To put this in perspective, directly annotating the 
5,640 images for 46 attributes and collecting five annota¬ 
tions per attributed would have required 1.2M binary anno¬ 
tations, i.e. roughly 12K USD at the very low rate of 10 per 
annotation. Using the proposed method, the cost would have 
been 546 USD. In practice, we spent around 2.5K USD in 
order to pay annotators better as well as to account for oc¬ 
casional errors in setting up experiments and the fact that, 
as explained above, bootstrapping still relies on exhaustive 
annotations for a subset of the data. 
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Fig. 3: Quality of sequential joint annotations. Each bar shows the average number of occurrences of a given attribute 
in a DTD image. The horizontal dashed line corresponds to a frequency of 1/47, the minimum given the design of DTD 


(Sect. 2.2 1 . The black portion of each bar is the amount of attributes discovered by the sequential procedure, using only 
10 annotations per image (about one fifth of the effort required for exhaustive annotation). The orange portion shows the 
additional recall obtained by integrating cross-validation in the process. Right: co-occurrence of attributes. The matrix 
shows the joint probability p(g, q') of two attributes occurring together (rows and columns are sorted in the same way as the 
left image). 


2.3 Benchmark tasks 

DTD is designed as a public benchmark. The data, includ¬ 
ing images, annotations, and splits, is available on the web at 
http://www.robots.ox.ac.uk/~vgg/data/dtd 
along with code for evaluation and reproducing the results 
in Sect. 12 

DTD defines two challenges. The first one, denoted 
DTD, is the prediction of key attributes, where each im¬ 
age is assigned a single label corresponding to the key at¬ 
tribute defined above. The second one, denoted DTD-J, is 
the joint prediction of multiple attributes. In this case each 
image is assigned one or more labels, corresponding to all 
the attributes that apply to that image. 

The first task is evaluated both in term of classifica¬ 
tion accuracy (acc) and in term of mean average precision 
(mAP), while the second task only in term of mAP due to 
the possibility of multiple labels. The classification accuracy 
is normalized per class: if c(x), c(x) S C} are re¬ 

spectively the predicted and ground-truth label of image x, 
accuracy is defined as 


2.4 Attributes vs materials 

As noted at the beginning of Sect. |2.l| and in ll86l . texture 
attributes and materials are correlated, but not equivalent. 
In this section we verify this quantitatively on the FMD 
data izi. Specifically, we manually collected annotations 
for the 47 DTD attributes for the 1,000 images in the FMD 
dataset, which span ten different materials. Each of the 47 
attributes was considered in turn, using a categorical ran¬ 
dom variable C G {1, 2,..., 10} to denote the texture ma¬ 
terial and a binary variable A G {0,1} to indicate whether 
the attribute applies to the texture or not. On average, the 
relative reduction in the entropy of the material variable 
I{A,C)/H{C) given the attribute is of about 14%; vice- 
versa, the relative reduction in the entropy of the attribute 
variable I {A, C)/H (A) given the material is just 0.5%. We 
conclude that knowing the material or attribute of a texture 
provides little information on the attribute or material, re¬ 
spectively. Note that combinations of attributes can predict 
materials much more reliably, although this is difficult to 
quantify from a small dataset. 


acc(c) 


1 ^ |{x : c(x) = c A c(x) = c}| 
|{x:c(x)=c}| 


( 1 ) 


We define mAP as per the PASCAL VOC 2008 benchmark 
onward ElQ 

DTD contains 10 preset splits into equally-sized train¬ 
ing, validation and test subsets for easier algorithm compar¬ 
ison. Results on any of the tasks are repeated for each split 
and average accuracies are reported. 


' PASCAL VOC 2007 uses instead an interpolated version of mAP. 


2.5 Related work 


This section relates DTD to the literature in texture under¬ 
standing. Textures, due to their ubiquitousness and comple¬ 
mentarity to other visual properties such as shape, have been 
studied in several contexts: texture perception ll2l[3l [T5l[J7]l . 
description ll34l . material recognition ll57ll57ll69llMll82l95l 
192 . segmentation |l2T]|^|42|421Ml|66l, synthesis l^lSOl 
fT04l . and shape from texture isiisiisa. Most related to 
DTD is the work on texture recognition, summarized below 
as the recognition of perceptual properties (Sect. |2.5.1[ ) and 
recognition of identities and materials (Sect. 2.5.2[) 
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Size 



Condition 



Content 


(I)nstances / 

Dataset 

Images 

Classes 

Splits 

Wild 

Clutter Controlled 

Attributes 

Materials 

Objects 

(C)ategories 

Brodatz 

999 

111 

- 



X 



X 

I 

CUReT 

5612 

61 

10 



X 


X 


I 

UIUC 

1000 

25 

10 



X 


X 


I 

UMD 

1000 

25 

10 

X 





X 

I 

KTH 

810 

11 

10 



X 


X 


I 

Outex 

- 

- 

- 



X 


X 

X 

I 

Drexel 

~40000 

20 

- 



X 


X 


I 

ALOT 

25000 

250 

10 



X 


X 


I 

FMD 

1000 

10 

14 

X 




X 


C 

KTH-T2b 

4752 

11 




X 


X 


C 

DTD 

5640 

47 

10 

X 



X 



c 

OS 

10422 

22 

1 

X 

X 


X(+A) 

X 


c 


Table 1: Comparison of existing texture datasets, in terms of size, collection condition, nature of the classes to be recognized, 
and whether each class includes a single object/material instance or several instances of the same category. Note that Outex 
is a meta-collection of textures spanning different datasets and problems. 


2.5.1 Recognition of perceptual properties 

The study of perceptual properties of textures originated 
in computer vision as well as in cognitive sciences. Some 
of the earliest work on texture perception conducted by 
Julesz 14^ focussed on pre-attentive aspects of perception. 
It led to the concept of “textons,” primitives such as line- 
terminators, crossings, intersections, etc., that are respon¬ 
sible for pre-attentive discrimination of textures. In com¬ 
puter vision, Tamura et al. 19^ identified six common di¬ 
rections of variability of images in the Broadatz dataset; 
coarse vs. fine; high-contrast vs. low-contrast; directional 
vs. non-directional; linelike vs. bloblike; regular vs. irregu¬ 
lar; and rough vs. smooth. Similar perceptual attributes of 
texture ECl have been found by other researchers. 

Our work is motivated by that of Bhusan et al. iiniiii. 
Their experiments suggest that there is a strong correlation 
between the structure of the lexical space and perceptual 
properties of texture. While they studied the psychological 
aspects of texture perception, the focus of this paper is the 
challenge of estimating such properties from images auto¬ 
matically. Their work US, in particular, identified a set of 
words sufficient to describe a wide variety of texture pat¬ 
terns; the same set of words was used to bootstrap DTD. 

While recent work in computer vision has been focussed 
on texture identification and material recognition, notable 
contributions to the recognition of percepmal properties ex¬ 
ist. Most of this work is part of the general research on visual 
attributes |[Il[33]|52l|7lll7ll. Texture attributes have an im¬ 
portant role in describing objects, particularly for those that 
are best characterized by a pattern, such as items of clothing 
and parts of animals such as birds. Notably, the first work on 
modern visual attributes by Ferrari et al. ll?4l focused on the 
recognition of a few perceptual properties of textures. Later 
work, such as im that mined visual attributes from images 
on the Internet, also contain some attributes that describe 
textures. Nevertheless, so far the attributes of textures have 


which material instance? which material category? 

Brodat?; CUReT KTH-TIPS Flickr MD 



sample 1 sample 35 bread foliage 


Fig. 4: Datasets such as Brodatz Qa and CUReT E?) 
(left) addressed the problem of material instance identifica¬ 
tion and others such as. KTH-T2b ll42l and FMD ll87l (right) 
addressed the problem of material category recognition. Our 
DTD dataset addresses a very different problem: the one of 
describing a pattern using intuitive attributes (Fig.|^. 


been investigated only tangentially. DTD address the ques¬ 
tion of whether there exists a “universal” set of attributes 
that can describe a wide range of texture patterns, whether 
these can be reliably estimated from images, and for what 
tasks they are useful. 

Datasets that focus on the recognition of subjective prop¬ 
erties of textures are less common. One example is Per- 
tex ll25l . containing 300 texture images taken in a controlled 
setting (Lambertian renderings of 3D reconstractions of real 
materials) as well as a semantic similarity matrix obtained 
form human similarity judgments. The work most related to 
ours is probably the one of El that analyzed images in the 
Outex dataset ll69]l using a subset of the texture attributes 
that we consider. DTD differs in scope (containing more at¬ 
tributes) and, especially, in the nature of the data (controlled 
vs uncontrolled conditions). In particular, working in uncon¬ 
trolled conditions allows us to transfer the texture attributes 
to real-world applications, including material recognition in 
the wild and in clutter, as shown in the experiments. 
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2.5.2 Recognition of texture instances and material 
categories 

Most of the recent work in texture recognition focuses on 
the recognition of texture instances and material categories, 
as reflected by the development of corresponding bench¬ 
marks (Fig. 1^. The Brodatz M catalogue was used in 
early works on textures to study the problem of identify¬ 
ing texture instances (e.g. matching half of the texture im¬ 
age given the other half). Others including CUReT lIZTll . 
UIUC m, KTH-TIPS nmil, Outex fgU, Drexel Texture 
Database ini, and ALOT CtI address the recognition of 
specific instances of one or more materials. UMD 01061 is 
similar, but the imaged objects are not necessarily composed 
of a single material. As textures are imaged under variable 
tmncation, viewpoint, and illumination, these datasets have 
stimulated the creation of texture representations that are in¬ 
variant to viewpoint and illumination changes ll57l |69] |95l 
1^ . Frequently, texture understanding is formulated as the 
problem of recognizing the material of an object rather than 
a particular texture instance (in this case any two slabs of 
marble would be considered equal). KTH-T2b 0651 is one 
of the hrst datasets to address this problem by grouping 
textures not only by the instance, but also by the type of 
materials (e.g. “wood”). 

However, these datasets make the simplifying assump¬ 
tion that textures hll images, and often, there is limited intra¬ 
class variability, due to a single or limited number of in¬ 
stances, captured under controlled scale, view-angle and il¬ 
lumination. Thus, they are not representative of the prob¬ 
lem of recognizing materials in natural images, where tex¬ 
tures appear under poor viewing conditions, low resolution, 
and in clutter. Addressing this limitation is the main goal 
of the Flickr Material Database (FMD) ll87l . FMD samples 
just one viewpoint and illumination per object, but contains 
many different object instances grouped in several different 
material classes. Sect. will introduce datasets addressing 
the problem of clutter as well. 

The performance of recognition algorithms on most of 
this data is close to perfect, with classification accuracies 
well above 95%; KTH-T2b and FMD are an exception due 
to their increased complexity. A review of these datasets and 
classihcation methodologies is presented in ll94ll . who also 
propose a training-free framework to classify textures, sig- 
nihcantly improving on other methods. Table [T] and Fig. 
provides a summary of the nature and size of various texture 
datasets that are used in our experiments. 

3 Recognizing textures in clutter 

This section looks at the second contribution of the paper, 
namely studying the recognition of materials and describ- 
able textures attributes not only “in the wild,” but also “in 


clutter”. Even in datasets such as FMD and DTD, in fact, 
each texture instance hlls the entire image, which doest not 
match most applications. This section removes this limita¬ 
tion and looks at the problem of recognizing textures imaged 
in the larger context of a complex natural scene, including 
the challenging task of automatically segmenting textured 
image regions. 

Rather than collecting a new image dataset from scratch, 
our starting point is the excellent OpenSurfaces (OS) dataset 
that was recently introduced by Bell et al. JS). OS comprises 
25,357 images, each containing a number of high-quality 
texture/material segments. Many of these segments are an¬ 
notated with additional attributes such as the material, view¬ 
point, BRDF estimates, and object class. Experiments focus 
on the 58,928 segments that contain material annotations. 
Since material classes are highly unbalanced, we consider 
only the materials that contain at least 400 examples. This 
results in 53,915 annotated material segments in 10,422 im¬ 
ages spanning 23 different classesj^Images are split evenly 
into training, validation, and test subsets with 3,474 images 
each. Segment sizes are highly variable, with half of them 
being relatively small, with an area smaller than 64 x 64 
pixels. One issue with crowdsourced collection of segmen¬ 
tations is that not all the pixels in an image are labelled. This 
makes it difficult to define a complete background class. Eor 
our benchmark several less common materials (including for 
example segments that annotators could not assign to a ma¬ 
terial) were merged in an “other” class that acts as the back¬ 
ground. 

This benchmark is similar to the one concurrently pro¬ 
posed by Bell et al. Go). However, in order to study percep¬ 
tual properties as well as materials, we also augment the OS 
dataset with some of the describable attributes of Sect. |2] 
Since the OS segments do not trigger with sufficient fre¬ 
quency all the 47 attributes, the evaluation is restricted to 
eleven of them for which it was possible to identify at least 
100 matching segments]^ The attributes were manually la¬ 
belled in the 53,915 segments retained for materials. We re¬ 
fer to this data as OSA. 

3.1 Benchmark tasks 

As for DTD, the aim is to define standardized image under¬ 
standing tasks to be used as public benchmarks. The com¬ 
plete list of images, segments, labels, and splits are publicly 

^ The classes and corresponding number of example segments are: 
brick (610), cardboard (423), carpet/rug (1,975), ceramic (1,643), con¬ 
crete (567), fabric/cloth (7,484), food (1,461), glass (4,571), granite/- 
marble (1,596), hair (443), other (2,035), laminate (510), leather (957), 
metal (4,941), painted (7,870), paper/tissue (1,226), plastic/clear (586), 
plastic/opaque (1,800), stone (417), tile (3,085), wallpaper (483), wood 
(9,232). 

^ These are: banded, blotchy, checkered, flecked, gauzy, grid, mar¬ 
bled, paisley, pleated, stratified, wrinkled. 
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available at http: //www. robots . ox . ac . uk/ ~vgg/ 
data/wildtex/, 

The benchmarks include two tasks on two complemen¬ 
tary semantic domains. The first task is the recognition of 
texture regions, given the region extent as ground truth infor¬ 
mation. This task is instantiated for both material, denoted 
OSh-R, and describable texture attributes, denoted OSAh-R. 
Performance in OSh-R is measured in term of classification 
accuracy and mAP, using the same definition Q where im¬ 
ages are replaced by image regions. Performance in OSAh-R 
uses instead mAP due to the possibility of multiple labels. 

The second task is the segmentation and recognition of 
texture regions, which we also instantiate for materials (OS) 
and describable texture attributes (OSA). Since not all image 
pixels are labelled in the ground truth, the performance of a 
predictor c is measured in term of per-pixel classification ac¬ 
curacy, pp-acc(c). This is computed using the same formula 
as Q with two modification; first, the images x are replaced 
by pixels p (extracted from all images in the dataset); sec¬ 
ond, the ground truth label c(p) of a pixel may take an ad¬ 
ditional value 0 to denote pixels that are not labelled in the 
ground truth (the effect is to ignore them in the computation 
of accuracy). 

In the case of OSA, the per-pixel accuracy is modified 
such that a class prediction is considered correct if it be¬ 
longs to any of the ground-truth pixel labels. Furthermore, 
accuracy is not normalized per class as this is ill-defined, but 
by the total number of pixels: 


acc-osa(c) 


Up : c(p) e c(p)}| 
|{P : c(p) f- (j)}\ 


( 2 ) 


where c(p) is the set of possible labels of pixel p and f 
denotes the empty set. 


4 Texture representations 

Having presented our contributions to framing the problem 
of texture description, we now turn to our technical advances 
towards addressing the resulting problems. We start by re¬ 
visiting the concept of texture representation and studies 
how it relates to modem image descriptors based on CNNs. 
In general, a visual representation is a map that takes an 
image x to a vector ^i)(x) G that facilitates understand¬ 
ing the image content. Understanding is often achieved by 
learning a linear predictor (f(x),w} scoring the strength 
of association between the image and a particular concept, 
such as an object category. 

Among image representations, this paper is particularly 
interested in the class of texture representations pioneered 
by the works of ||l5]|57]|6l|Ml- Textures encompass a large 
diversity of visual patterns, from regular repetitions such as 
wallpapers, to stochastic processes such as fur, to intermedi¬ 
ate cases such as pebbles. Distortions due to viewpoint and 


other imaging factors further complicate modeling textures. 
However, one can usually assume that, given a particular 
texture, appearance variations are statistically independent 
in the long range and can therefore be eliminated by aver¬ 
aging local image statistics over a sufficiently large texture 
sample. Hence, the defining characteristic of texture repre¬ 
sentations is to pool information extracted locally and uni¬ 
formly from the image, by means of local descriptors, in an 
orderiess manner. 

The importance of texture representations is in the fact 
that they were found to be applicable well beyond textures. 
For example, until recently many of the best object catego¬ 
rization methods in challenges such as PASCAL VOC 1^ 
and ImageNet ILSVRC ESll were based on variants of tex¬ 
ture representations, developed specifically for objects. One 
of the contributions of this work is to show that these object- 
optimized texture representations are in fact optimal for a 
large number of texture-specific problems too (Sect. 6.1.31. 

More recently, texture representations have been sig¬ 
nificantly outperformed by Convolutional Neural Networks 
(CNNs) in object categorization ED, detection seg¬ 
mentation ED, and in fact in almost all domains of image 
understanding. Key to the success of CNNs is their ability 
to leverage large labelled datasets to learn high-quality fea¬ 
tures. Importantly, CNN features pre-trained on very large 
datasets were found to transfer to many other domains with 
a relatively modest adaptation effort ll20l l39l WT\ ITOI [84ll . 
Hence, CNNs provide general-purpose image descriptors. 

While CNNs generally outperform classical texture rep¬ 
resentations, it is interesting to ask what is the relation be¬ 
tween these two methods and whether they can be fruit¬ 
fully hybridized. Standard CNN-based methods such as lIMl 
[391 in |2Q1 El can be interpreted as extracting local im¬ 
age descriptors (performed by the the so called “convolu¬ 
tional layers”) followed by pooling such features in a global 
image representation (performed by the “Fully-Connected 
(FC) layers”). Here we will show that replacing FC pool¬ 
ing with one of the many pooling mechanisms developed in 
texture representations has several advantages: (i) a much 
faster computation of the representation for image subre¬ 
gions accelerating applications such as detection and seg¬ 
mentation EUSollll], (ii) a significantly superior recog¬ 
nition accuracy in several application domains and (iii) the 
ability of achieving this superior performance without fine- 
tuning CNNs by implicitly reducing the domain shift prob¬ 
lem. 

In order to systematically study variants of texture rep¬ 
resentations (j) = fg o (pf , we break them into local de¬ 
scriptor extraction ff followed by descriptor pooling ff,. 
In this manner, different combinations of each component 
can be evaluated. Common local descriptors include linear 
filters, local image patches, local binary patterns, densely- 
extracted SIFT features, and many others. Since local de- 
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scriptors are extracted uniformly from the image, they can 
be seen as banks of (non-linear) filters; we therefore refer 
to them as filter banks in honor of the pioneering works 
of ITSl [3^ |57l l64l and others where descriptors were the 
output of actual linear filters. Pooling methods include bag- 
of-visual-words, variants using soft-assignment, or extract¬ 
ing higher-order statistics as in the Fisher vector. Since these 
methods encode the information contained in the local de¬ 
scriptors in a single vector, we refer to them as pooling en¬ 
coders. 

Sect. |4.1| and Sect. |4.2| discuss filter banks and pooling 
encoders in detail. 

4.1 Local image descriptors 


quantization schemes; the most common one maps the bit 
string di to one of a number of uniform patterns ll69l . The 
quantized LBPs can be averaged over the image to build a 
histogram; alternatively, such histograms can be computed 
for small image patches and used in turn as local image de¬ 
scriptors. 

In the context of object recognition, the best known local 
descriptor is undoubtedly D. Lowe’s SIFT lIMl . SIFT is the 
histogram of the occurrences of image gradients quantized 
with respect to their location within a patch as well to their 
orientation. While SIFT was originally introduced to match 
object instances, it was later applied to an impressive diver¬ 
sity of tasks, from object categorization to semantic segmen¬ 
tation and face recognition. 


There is a vast choice of local image descriptors in tex¬ 
ture representations. Traditionally, these features were hand¬ 
crafted , but with the latest generation of deep learning meth¬ 
ods it is now customary to learn them from data (although 
often in an implicit form). Representative examples of these 
two families of local features are discussed in Sect. l4. l.Tl and 


Sect. 4.1.2 respectively. 


4.1.1 Hand-crafted local descriptors 


Some of the earliest local image descriptors were developed 
as linear filter banks in texture recognition. As an evolution 
of earlier texture filters ifTSl l62]| . the filter bank of Leung 
Malik (LM) 1581 includes 48 filters matching bars, edges 
and spots, at various scales and orientations. These filters 
are first and second derivatives of Gaussians at 6 orientations 
and 3 scales (36), 8 Laplacian of Gaussian (LOG) filters, and 
4 Gaussians. Combinations of the filter responses, identified 
by vector quantisation (Sect. 4.2. 1[ ), were used as the compu¬ 
tational basis of the “textons” proposed by Julesz ll49ll . The 
filter bank MRS of lIMl l95l consists instead of 38 filters, 
similar to LM. For two of the oriented filters, only the max¬ 
imum response across the scales is recorded, reducing the 
number of responses to 8 (3 scales for two oriented filters, 
and two isotropic - Gaussian and Laplacian of Gaussian). 

The importance of using linear filters as local features 
was later questioned by Varma and Zisserman ll95l . The VZ 
descriptors are in fact small image patches which, remark¬ 
ably, were shown to outperform LM and MR8 on earlier 
texture benchmarks such as CuRET. However, as will be 
demonstrated in the experiments, trivial local descriptors are 
not competitive in harder tasks. 

Another early local image descriptor are the Local Bi¬ 
nary Patterns (LBP) of 16811^ . a special case of the tex¬ 
ture units of n03l . A LBP di = (6i,..., bm) computed a 
pixel po is the sequence of bits bj = [x(pi) > x(pj)] com¬ 
paring the intensity x(pi) of the central pixel to the one of m 
neighbors pj (usually 8 in a circle). LBPs have specialized 


4.1.2 Learned local descriptors 

Handcrafted image descriptors are nowadays outperformed 
by features learned using the latest generation of deep CNNs ED. 
A CNN can be seen as a composition fix o ■ ■ ■ o (j >2 o fi 
of K functions or layers. The output of each layer x^ = 
ts a descriptor field :x.k G 

where Wk and are the width and height of the field and 
Dk is the number of feature channels. By collecting the 
Dk responses at a certain spatial location, one obtains a Dk 
dimensional descriptor vector. The network is called con¬ 
volutional if all the layers are implemented as (non-linear) 
filters, in the sense that they act locally and uniformly on 
their input. If this is the case, since compositions of filters 
are filters, the feature field Xfc is the result of applying a 
non-linear filter bank to the image x. 

As computation progresses, the resolution of the de¬ 
scriptor fields decreases whereas the number of feature chan¬ 
nels increases. Often, the last several layers fik of a CNN 
are called “fully connected” because, if seen as filters, their 
support is the same as the size of the input field and 
therefore lack locality. By contrast, earlier layers that act lo¬ 
cally will be referred to as “convolutional”. If there are C 
convolutional layers, the CNN cj) = fie ° (f’f can be decom¬ 
posed into a filter bank (local descriptors) fif = 
followed by a pooling encoder fie = fix o • • • o fic+i- 


4.2 Pooling encoders 


A pooling encoder takes as input the local descriptors ex¬ 
tracted from an image x and produces as output a single 
feature vector fifix.), suitable for tasks such as classifica¬ 
tion with an SVM. A first important differentiating factor 
between encoders is whether they discard the spatial config¬ 
uration of input features (orderless pooling; Sect. 4.2. l| i or 
whether they retain it (order-sensitive pooling; Sect. 4.2.2 1 . 
A detail of practical importance, furthermore, is the type of 
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post-processing applied to the pooled vectors (post-processing; 
Sect. 14^. 

4.2.1 Orderless pooling encoders 

An orderless pooling encoder maps a sequence J- = 
(fi,..., f„), fi S of local image descriptors to a feature 
vector 4)e{lF) G The encoder is orderless in the sense 
that the function </>e is invariant to permutations of the input 
J^j^Furthermore, the encoder can be applied to any number 
of features; for example, the encoder can be applied to the 
sub-sequence T' <Z T of local descriptors contained in a tar¬ 
get image region without recomputing the local descriptors 
themselves. 

All common orderless encoders are obtained by apply¬ 
ing a non-linear descriptor encoder ri{fi) G to individual 
local descriptors and then aggregating the result by using a 
commutative operator such as average or max. For example, 
average-pooling yields ^e(-^) = ^ The pooled 

vector ^e{^) is post-processed to obtain the final represen¬ 
tation 4)e{d-) as discussed later. 

The best-known orderless encoder is the Bag of Visual 
Words (BoVW). This encoder starts by vector-quantizing 
(VQ) the local features f^ G by assigning them to their 
closest visual word in a dictionary C = [ci ... G 
Qf d elements. Visual words can be thought of as 
“prototype features” and are obtained during training by 
clustering example local features. The descriptor encoder 
771 (fj ) is the one-hot vector indicating the visual word cor¬ 
responding to fi and average-pooling these one-hot vectors 
yields the histogram of visual words occurrences. BoVW 
was introduced in the work of 1581 to characterize the dis¬ 
tribution of textons, defined as configuration of local filter 
responses, and then ported to object instance and category 
understanding by 1971 and f2E\ respectively. It was then ex¬ 
tended in several ways as described below. 

The kernel codebook encoder ||78l assigns each local 
feature to several visual words, weighted by a degree of 
membership: [pKc(fi)]i oc exp (—A||f,; — c^jp), where A 
is a parameter controlling the locality of the assignment. 
The descriptor code ? 7 Kc(fi) is normalized before ag¬ 
gregation, such that ||pKc(fi)|li = 1- Several related meth¬ 
ods used concepts from sparse coding to define the local 
descriptor encoder 159111101 . Locality constrained Linear 
Coding (LLC) ITO^ . in particular, extends soft assignment 
by making the assignments reconstructive, local, and sparse: 

the descriptor encoder r7LLc(fi) G R+, ||?7LLc(fi)|| 1 = 1, 

||??LLc(fi)||o < r is computed such that U « Cpu^ci^i) 
while allowing only the r d visual words closer to f^ 
to have a non-zero coeffcient. 

Note that IF cannot be represented as a set as encoders are gener¬ 
ally sensitive to repetitions of feature descriptors. It could be defined 
as a multiset or, as done here, as a sequence F. 


In the Vector of Locally-Aggregated Descriptors (VLAD) 
i46i the descriptor encoder is richer. Local image descriptors 
are first assigned to their nearest neighbor visual word in a 
dictionary of K elements like in BoVW; then the descriptor 
encoder is given by 77vLAD(f*) = (fi - C'? 7 i(fi)) (g) pi(fi), 
where (g is the Kronecker product. Intuitively, this subtracts 
from ii the corresponding visual word Crii{ii) and then 
copies the difference into one of K possible subvectors, one 
for each visual word. Hence average-pooling 77vLAD(fi) ac¬ 
cumulates first-order descriptor statistics instead of simple 
occurrences as in BoVW. 

VLAD can be seen as a variant of the Fisher Vector 
(FV) 1751 . The FV differs from VLAD as follows. First, 
the quantizer is not AT-means but a Gaussian Mixture Model 
(GMM) with components {iTk, p,k, ^k), k = 1,... ,K, where 
TTfc G K is the prior probability of the component, pk G 
the Gaussian mean and Ek G the Gaussian covari¬ 

ance (assumed diagonal). Second, hard-assignments pi(fi) 
are replaced by soft-assignments 77GMM(fi) given by the pos¬ 
terior probability of each GMM component. Third, the FV 

_ 1 

descriptor encoder r/pv(fi) includes both first E^, ^ (f^ — pk) 
and second order Ejr^(fi — pk) © (7 — Pk) — 1 statistics, 
weighted by ?7GMM(fi) (see 119117511771 for details). Hence, 
average pooling pFv(fi) accumulates both first and second 
order statistics of the local image descriptors. 

All the encoders discussed above use average pooling, 
except LLC that uses max pooling. 

4.2.2 Order-sensitive pooling encoders 

An order-sensitive encoder differs from an orderless en¬ 
coder in that the map is not invariant to permutation 

of the input F. Such an encoder can therefore reflect the 
layout of the local image desctiptors, which may be ineffec¬ 
tive or even counter-productive in texture recognition, but 
is usually helpful in the recognition of objects, scenes, and 
others. 

The most common order-sensitive encoder method is the 
Spatial Pyramid Pooling (SPP) of l55l. SSP transforms 
any orderless encoder into one with (weak) spatial sensi¬ 
tivity by dividing the image in subregions, computing any 
encoder for each subregion, and stacking the results. This 
encoder is only be sensitive to reassignments of the local 
descriptors to different subregions. 

The Fully-Connected layers (FC) in a CNN also form 
an order-sensitive encoder. Compared to the encoders seen 
above, FC are pre-trained discriminatively, which can be ei¬ 
ther an advantage or disadvantage, depending on whether 
the information that they captured can be transferred to the 
domain of interest. FC poolers are much less flexible than 
the encoders seen above as they work only with a particular 
type of local descriptors, namely the corresponding CNN 
convolutional layers. Furthermore, a standard FC pooler can 
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only operate on a well defined layout of local descriptors 
(e.g. a 6 X 6), which in turn means that the image needs to be 
resized to a standard size before the FC encoder can be eval¬ 
uated. This is particularly expensive when, as in object de¬ 
tection or image segmentation, many image subregions must 
be considered. 


4.2.3 Post-processing 


5 Plan of experiments and highlights 


The next several pages contain an extensive set of experi¬ 
mental results. This section provides a guide to these exper¬ 
iments and summarizes the main findings. 

The goal of the first block of experiments (Sect. 6.1 1 is 
to determine which representations work bests on different 
problems such as texture attribute, texture material, object, 
and scene recognition. The main findings are: 


The vector y = (•^) obtained by pooling local image de¬ 

scriptors is usually post-processed before being used in a 
classifier. In the simplest case, this amounts to performing 
normalization (j)e{3F) = y/||y|| 2 - However, this is usu¬ 
ally preceded by a non-linear transformation (j)K{y) which 
is best understood in term of kernels. A kernel K{y',y”) 
specifies a notion of similarity between data points y' and 
y". If AT is a positive semidefinite function, then it can al¬ 
ways be rewritten as the inner product {(t>K{y'),4’K{y")) 
where is a suitable pre-processing function called a ker¬ 
nel embedding MM- Typical kernels include the linear, 
Bellinger’s, additive-y^, and exponential-y^ ones, given re¬ 
spectively by: 


(y',y"), 




2=1 


exp 




+ y'l 


i=l 


In practice, the kernel embedding (j)K can be computed eas¬ 
ily only in a few cases, including the linear kernel {<j)K is the 
identity) and Bellinger’s kernel (for each scalar component, 
0HeU.(?/) = ^/y)- In the latter case, if y can take negative val¬ 
ues, then the embedding is extended to the so called signed 
square roof/n.^(^Heit(t/) = sign(j/)y|?/[. 

Even if (j)K is not explicitly computed, any kernel can 
be used to learn a classifier such as an SVM (kernel trick). 
In this case, normalizing the kernel embedding (j)K{y) 
amounts to normalizing the kernel as 

A-'(y,y")= , . 

y/K{y',y')K{y”,y") 


All the pooling encoders discussed above are usually fol¬ 
lowed by post-processing. In particular, the Improved Fisher 
Vector (IFV) llTTlI prescribes the use of the signed-square 
root embedding followed by normalization. VLAD has 
several standard variants that differ in the post-processing; 
here we use the one that normalizes the individual VLAD 
subvectors (one for each visual word) before normalizing 
the whole vector i). 

^ This extension generalizes to all homogeneous kernels, including 
for example 1991 . 


- Orderless pooling of SIFT features (e.g. FV-SIFT) per¬ 
forms better than specialized texture descriptors in many 
texture recognition problems; performance is further im¬ 
proved by switching from SIFT to CNN local descriptors 
(FV-CNN; Sect. [ 6 X 3 . 

- Orderless pooling of CNN descriptors using the Fisher 
Vector (FV-CNN) is often significantly superior than 
fully-connected pooling of the same descriptors (FC- 


CNN) in texture, scene, and object recognition (Sect. 6.1.41 
This difference is more marked for deeper CNN archi¬ 
tectures (Sect. |6.L5[ ) and can be partially explained by 
the ability of FV pooling to overfit less and to easily inte¬ 


grate information at multiple image scales (Sect. 6.1.6 1 . 
- FV-CNN descriptors can be compressed to the same di¬ 
mensionality of FC-CNN descriptors while preserving 
accuracy (Sect. 6.L7|i. 


Having determined good representations in Sect. 6.1 the 
second block of experiments (Sect. |6.2[ ) compares them to 
the state of the art in texture, object, and scene recognition. 
The main findings are: 

- In texture recognition in the wild, for both materials 
(FMD) and attributes (DTD), CNN-based descriptors 
substantially outperform existing methods. Depending 
on the dataset, FV pooling is a little or substantially bet¬ 
ter than FC pooling of CNN descriptors (Sect. |6.2.1.4[ ). 
When textures are extracted from a larger cluttered scene 
(instead of filling the whole image), the difference be¬ 
tween FV and FC pooling increases (Sect. 6.2.L5| l. 

- In coarse object recognition (PASCAL VOC), fine-grained 
object recognition (CUB-200), scene recognition (MIT 
Indoor), and recognition of things & stuff (MSRC) fine¬ 
grained, the FV-CNN representation achieves results 
that are close and sometimes superior to the state of 
the art, while using a simple and fully generic pipeline 
(Sect. |6X3] l. 

- FV-CNN appears to be particularly effective in domain 


transfer. Sect. 6.2.3 shows in fact that FV pooling com¬ 
pensates for the domain gap caused by training a CNN 
on two very different domains, namely scene and object 
recognition. 

Having addressed image classification in Sect. 6.1 and |6.2| 
The third block of experiments (Sect. compare represen¬ 
tations on semantic segmentation. It shows that FV pooling 
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of CNN descriptors can be combined with a region proposal 
generator to obtain high-quality segmentation of materials 
in the OpenSurfaces and MSRC data. For example, com¬ 
bined with a post-processing step using a CRF, FV-VGG- 
VD surpasses the state-of-the-art on the latter dataset. It is 
also shown that, differently from FV-CNN, FC-CNN is too 
slow to be practical in this scenario. 


6 Experiments on semantic recognition 

So far the paper has introduced novel problems in texture 
understanding as well as a number of old and new texture 
representations. The goal of this section is to determine, 
through extensive experiments, what representations work 
best for which problem. 

Representations are labelled as pairs X-Y, where X is a 
pooling encoder and Y a local descriptor. For example, FV- 
SIFT denotes the Fisher vector encoder applied to densely 
extracted SIFT descriptors, whereas BoVW-CNN denotes 
the bag-of-visual-words encoder applied on top of CNN 
convolutional descriptors. Note in particular that the CNN- 
based image representations as commonly extracted in the 
literature EOl HtI [84ll implicitly use CNN-based descrip¬ 
tors and the FC pooler, and therefore are denoted here as 
FC-CNN. 


6.1 Local image descriptors and encoders evaluation 


This section compares different local image descriptors and 
pooling encoders (Sect. 6.1.1| i on selected representative 
tasks in texture recognition, object recognition, and scene 
recognition (Sect. [6T!2| i. In particular. Sect. |6. 1 .3] compares 
different local descriptors. Sect. |6. 1 .4| different pooling en¬ 
coders, and Sect. |6.1.5] additional variants of the CNN-based 
descriptors. 


6.1.1 General experimental setup 

The experiments are centered around two types of local 
descriptors. The first type are SIFT descriptors extracted 
densely from the image (denoted DSIFT). SIFT descriptors 
are sampled with a step of two pixels and the support of 
the descriptor is scaled such that a SIFT spatial bin has size 
8 x8 pixels. Since there are 4 x 4 spatial bins, the support 
or “receptive field” of each DSIFT descriptor is 40 x 40 
pixels, (including a border of half a bin due to bilinear in¬ 
terpolation). Descriptors are 128-dimensional iQi, but their 
dimensionality is further reduced to 80 using PCA, in all 
experiments. Besides improving the classification accuracy, 
this significantly reduces the size of the Fisher Vector and 
VLAD encodings. 


The second type of local image descriptors are deep con¬ 
volutional features (denoted CNN) extracted from the con¬ 
volutional layers of CNNs pre-trained on ImageNet ILSVRC 
data. Most experiments build on the VGG-M model of 1^ 
as this network performs better than standard networks such 
as the Caffe reference model HTI and AlexNet lISTIl while 
having a similar computational cost. The VGG-M convolu¬ 
tional features are extracted as the output of the last convolu¬ 
tional layer, directly from the linear filters excluding ReLU 
and max pooling, which yields a field of 512-dimensional 
descriptor vectors. In addition to VGG-M, experiments con¬ 
sider the recent VGG-VD (very deep with 19 layers) model 
of Simonyan and Zisserman lf90l . The receptive field of 
CNN descriptors is much larger compared to SIFT: 139 x 
139 pixels for VGG-M and 252 x 252 for VGG-VD. 

When combined with a pooling encoder, local descrip¬ 
tors are extracted at multiple scales, obtained by rescaling 
the image by factors 2®, s = —3, —2.5,..., 1.5 (but, for ef¬ 
ficiency, discarding scales that would make the image larger 
than 1024^ pixels). 

The dimensionality of the final representation strongly 
depends on the encoder type and parameters. For K visual 
words, BoVW and LLC have K dimensions, VLAD has 
KD and FV 2KD, where D is the dimension of the lo¬ 
cal descriptors. For the FC encoder, the dimensionality is 
fixed by the CNN architecture; here the representation is ex¬ 
tracted from the penultimate FC layer (before the final clas¬ 
sification layer) of the CNNs and happens to have 4096 di¬ 
mensions for all the CNNs considered. In practice, dimen¬ 
sions vary widely, with BoVW, LLC, and FC having a com¬ 
parable dimensionality, and VLAD and FV a much higher 
one. For example, FV-CNN has ~ 64 • 10^ dimensions with 
K — QA Gaussian mixture components, versus the 4096 of 
FC, BoVW, and LLC (when used with K — 4096 visual 
words). In practice, however, dimensions are hardly compa¬ 
rable as VLAD and FV vectors are usually highly compress¬ 
ible 1231. We verified that by using PCA to reduce FV to 
4096 dimensions and observing only a marginal reduction 
in classification performance in the PASCAL VOC object 
recognition task, as described below. 

Unless otherwise specified, learning uses a standard non¬ 
linear SVM solver. Initially, cross-validation was used to se¬ 
lect the parameter C of the SVM in the range {0.1,1,10,100}; 
however, after noting that performance was nearly identical 
in this range (probably due to the data normalization), C 
was simply set to the constant 1. Instead, it was found that 
recalibrating the SVM scores for each class improves clas¬ 
sification accuracy (but of course not mAP). Recalibration 
is obtained by changing the SVM bias and rescaling the 
SVM weight vector in such a way that the median scores of 
the negative and positive training samples for each class are 
mapped respectively to the values —1 and 1. 
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All the experiments in the paper use the VLFeat li¬ 
brary 1221 for the computation of SIFT features and the 
pooling embedding (BoVW, VLAD, FV). The MatCon- 
vNet 1981 library is used instead for all the experiments in¬ 
volving CNNs. Further details specihc to the setup of each 
experiment are given below as needed. 


6.1.2 Datasets and evaluation measures 


The evaluation is performed on a diversity of tasks; the new 
describable attribute and material recognition benchmarks 
in DTD and OpenSurfaces, existing ones in FMD and KTH- 
T2b, object recognition in PASCAL VOC 2007, and scene 
recognition in MIT Indoor. All experiments follow standard 
evaluation protocols for each dataset, as detailed below. 

DTD (Sect.|g contains 47 texture classes, one per visual 
attribute, containing 120 images each. Images are equally 
spilt into train, test and validation, and include experiments 
on the prediction of “key attributes” as well as “joint at¬ 


tributes”, as as defined in Sect. 2.1 and reports accuracy av¬ 
eraged over the 10 default splits provided with the datasets. 
OpenSurfaces ||8l is used in the setup described in Sect. 
and contains 25,357 images, out of which we selected 10,422 
images, spanning across 21 categories. When segments are 
provided, the dataset is referred to as OSh-R, and recognition 
accuracy is reported on a per-segment basis. We also anno¬ 
tated the segments with the attributes from DTD, and called 
this subset OSA (and OSAh-R for the setup when segments 
are provided). For the recognition task on OSAh-R we report 
mean average precision, as this is a multi-label dataset. 

FMD ll87l consists of 1,000 images with 100 for each 
of ten material categories. The standard evaluation proto¬ 
col of ll87l uses 50 images per class for training and the re¬ 
maining 50 for testing, and reports classification accuracy 
averaged over 14 splits. KTH-T2b ll65l contains 4,752 im¬ 
ages, grouped into 11 material categories. For each material 
category, images of four samples were captured under vari¬ 
ous conditions, resulting in 108 images per sample. Follow¬ 
ing the standard procedure lfT8ll94l . images of one material 
sample are used to train the model, and the other three sam¬ 
ples for evaluating it, resulting in four possible splits of the 
data, for which average per-class classification accuracy is 
reported. MIT Indoor Scenes lf82l contains 6,700 images 
divided in 67 scene categories. There is one split of the data 
into train (80%) and test (20%), provided with the dataset, 
and the evaluation metric is average per-class classification 
accuracy. PASCAL VOC 2007 ll^ contains 9,963 images 
split across 20 object categories. The dataset provides a stan¬ 
dard split in training, validation and test data. Performance is 
reported in term of mean average precision (mAP) computed 
using the TRECVID 11-point interpolation scheme 


® The procedure for computing the AP was changed in later versions 
of the benchmark. 


Local descr. 

Linear 

Kernel 

Hellinger add-;\ 

2 

exp-;V 

2 

MR 8 

20.8 

± 

0.9 

26.2 

± 

0.8 

29.7 

± 

0.9 

34.3 

± 

1.1 

LM 

26.7 

± 

0.9 

34.8 

± 

1.2 

39.5 

± 

1.4 

44.0 

± 

1.4 

Patcha X 3 

15.9 

± 

0.5 

24.4 

± 

0.7 

27.8 

± 

0.8 

30.9 

± 

0.7 

Patchr X 7 

20.7 

± 

0.8 

30.6 

± 

1.0 

34.8 

± 

1.0 

37.9 

± 

0.9 

LBP“ 

8.5 

± 

0.4 

9.3 

± 

0.5 

12.5 

± 

0.4 

19.4 

± 

0.7 

LBP-VQ 

26.2 

± 

0.8 

28.8 

± 

0.9 

32.7 

± 

1.0 

36.1 

± 

1.3 

SIFT 

45.2 

± 

1.0 

49.1 

± 

1.1 

50.9 

± 

1.0 

52.3 

± 

1.2 

Conv VGG-M 

55.9 

± 

1.3 

61.7 

± 

0.9 

61.9 

± 

1.0 

61.2 

± 

1.0 

Conv VGG-VD 

64.1 

± 

1.3 

68.8 

± 

1.3 

69.0 

± 

0.9 

68.8 

± 

0.9 


Table 2; Comparison of local features and kernels on the 
DTD data. The table reports classification accuracy, av¬ 
eraged over the predefined ten splits, provided with the 
dataset. We marked in bold the best performing descriptors, 
SIFT and convolutional features, which we will cover in the 
following experiments and discussions. 


6.1.3 Local image descriptors and kernels comparison 

The goal of this section is to establish which local image de¬ 
scriptors work best in a texture representation. The question 
is relevant because: (i) while SIFT is the de-facto standard 
handcrafted-feature in object and scene recognition, most 
authors use specialized descriptors for texture recognition 
and (ii) learned convolutional features in CNNs have not yet 
been compared when used as local descriptors (instead, they 
have been compared to classical image representations when 
used in combination with their FC layers). 

The experiments are carried on the the task of recogniz¬ 
ing describable texture attributes in DTD (Sect. using the 
BoVW encoder. As a byproduct, the experiments determine 
the relative difficulty of recognizing the different 47 percep¬ 
tual attributes in DTD. 


6.1.3.1 Experimental setup. The following local image de¬ 
scriptors are compared; the linear hlter banks of Leung and 
Malik (LM) ll57l (48D descriptors) and MRS ( 8 D descrip¬ 
tors) |l38]|96l, the 3 X 3 and 7x7 raw image patches of 19^ 
(respectively 9D and 49D), the local binary patterns (LBP) 
of l69l (58D), SIFT (128D), and CNN features extracted 


from VGG-M and VGG-VD (512D). 

After the BoVW representation is extracted, it is used 
to train a 1-vs-all SVM using the different kernels dis- 
linear, Hellinger, additive-x^, and 


4.2.3 


cussed in Sect. 

exponential-x^. Kernels are normalized as described before. 
The exponential-x^ kernel requires choosing the parameter 
A; this is set as the reciprocal of the mean of the x^ distance 
matrix of the training BoVW vectors. Before computing the 
exponential-x^ kernel, furthermore, BoVW vectors are L^ 
normalized. An important parameter in BoVW is the num¬ 
ber of visual words selected. K was varied in the range 
of 256, 512, 1024, 2048, 4096 and performance evaluated 
on a validation set. Regardless of the local feature and em- 
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Fig. 5; Per class classification accuracy in the DTD data comparing three local image descriptors; SIFT, VGG-M, and 
VGG-VD. For all three local descriptors, BoVW with 4096 visual words was used. Classes are sorted by increasing BoVW- 
CNN-VD accuracy (this number is reported along each bar). 


bedding, performance was found to increase with K and 
to saturate around K = 4096 (although the relative bene- 
ht of increasing K was larger for features such as SIFT and 
CNNs). Therefore K was set to this value in all experiments. 


6.1.3.2 Analysis. Table [^reports the classihcation accuracy 
for 47 1-vs-all SVM attribute classifiers, computed as 0. 
As often found in the literature, the best kernel was found 
to be exponential-^^, followed by additive-x^, Hellinger’s, 
and linear kernels. Among the hand-crafted descriptors, 
dense SIFT significantly outperforms the best specialized 
texture descriptor on the DTD data (52.3% for BoVW- 
exp-x^-SIFT vs 44% for BoVW-exp-x^-LM). CNN local 
descriptors handily outperform handcrafted features by a 
10-15% recognition accuracy margin. It is also interest¬ 
ing to note that the choice of kernel function has a much 
stronger effect for image patches and linear biters {e.g. ac¬ 
curacy nearly doubles moving from BoVW-linear-patches 
to BoVW-exp-x^-patches) and an almost negligible effect 
for the much stronger CNN features. 

Fig.|5] reports the classihcation accuracy for each at¬ 
tribute in DTD for the BoVW-SIFT, BoVW-VGG-M, and 
BoVW-VGG-VD descriptors and the additive-x^ kernel. As 
it may be expected, concepts such as chequered, waffled, 
knitted, paisley achieve nearly perfect classihcation, while 
others such as blotchy, smeared or stained are far harder. 


6.1.3.3 Conclusions. The conclusions are that (i) SIFT de¬ 
scriptors outperform signihcantly texture-specihc descrip¬ 
tors such as linear hlter banks, patches, and LBP on this 
texture recognition task, and that (ii) learned convolutional 
local descriptors signihcantly surpass SIFT. 


6.1.4 Pooling encoders 

The previous section established the primacy of SIFT and 
CNN local image descriptors on alternatives. The goal of 
this section is to determine which pooling encoders (Sect. |4.2] ) 
work best with these descriptors, comparing the orderless 
BoVW, LLC, VLAD, FV encoders and the order-sensitive 
FC encoder. The latter, in particular, reproduces the CNN 
transfer learning setting commonly found in the literature 
where CNN features are extracted in correspondence to the 
FC layers of a network. 


6.1.4.1 Experimental setup. The experimental setup is sim¬ 
ilar to the previous experiment; the same SIFT and CNN 
VGG-M descriptors are used; BoVW is used in combination 
with the Hellinger kernel (the exponential variant is slightly 
better, but much more expensive); the same K = 4096 code¬ 
book size is used with LLC. VLAD and FV use a much 
smaller codebook as these representations multiply the di¬ 
mensionality of the descriptors (Sect. 6.L1| |. Since SIFT and 
CNN features are respectively 128 and 512-dimensional, K 
is set to 256 and 64 respectively. The impact of varying the 
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Dataset 

meas. 

(%) 

BoVW 

SIFT 

LLC VLAD 

IFV 

BoVW 

VGG-M 

LLC VLAD 

IFV 

VGG-M 

FC 

DTD 

acc 

49.0±o.8 

48.2±1.4 

54.3±0.8 

58.6+1.2 

61.2+1.3 

64.0+1.3 

67.6±o.7 

66.8±i.5 

58.7±0.9 

OS+R 

acc 

30.0 

30.8 

32.5 

39.8 

41.3 

45.3 

49.7 

52.5 

41.3 

KTH-T2b 

acc 

57.6+1.5 

56.8+2.0 

64.3±i.3 

70.2±i.6 

73.6±2.8 

74.0±3.3 

72.2±4.7 

73.3±4.8 

71.0±2.3 

FMD 

acc 

50.5+1.7 

48.4±2.2 

54.0+1.3 

59.7+1.6 

67.9+2.2 

71.7+2.1 

74.2±2.o 

73.5+2.0 

70.3±i.8 

VOC07 

mAP II 

51.2 

47.8 

56.9 

59.9 

72.9 

75.5 

76.8 

76.4 

76.8 

MIT Indoor 

acc 

47.7 

39.2 

51.0 

54.9 

69.1 

68.9 

71.2 

74.2 

62.5 


Table 3: Pooling encoder comparisons. The table compares the orderless pooling encoders BoVW, LLC, VLAD, and IFV 
with either SIFT local descriptors and VGG-M CNN local descriptors (FV-CNN). It also compares pooling convolutional 
features with the CNN fully connected layers (FC-CNN). The table reports classification accuracies for all datasets except 
VOC07 and OS+R for which mAP-11 ll32ll and mAP are reported, respectively. 


number of visual words in the FV representation is further 


analyzed in Sect. 6.1.5 


Before pooling local descriptors with a FV, these are 
usually de-correlated by using PCA whitening. Here PCA is 
applied to SIFT, additionally reducing its dimension to 80, 
as this was empirically shown to improve recognition per¬ 
formance. The effect of PCA-reduction to the convolutional 


features is studied in Section 6.1.7 The improved version 
of the FV is used in all the experiments (Sect.j^, and, sim¬ 
ilarly, for VLAD, we applied signed square root to the re¬ 
sulting encoding, which is then normalized component-wise 
(Sect. [4T^ . 


6.1.4.2 Analysis. Results are reported in Table|^ In term of 
orderless encoders, BoVW and LLC result in similar perfor¬ 
mance for SIFT, while the difference is slightly larger and 
in favor of LLC for CNN features. Note that BoVW is used 
with the Hellinger kernel, which contributes to reducing the 
gap between BoVW and LLC. IFV and VLAD significantly 
outperform BoVW and LLC in almost all tasks; FV is def¬ 
initely better than VALD with SIFT features and about the 
same with CNN features. CNN features maintain a healthy 
lead on SIFT features regardless of the encoder used. Impor¬ 
tantly, VLAD and FV (and to some extent BoVW and LLC) 
perform either substantially better or as well as the original 
FC encoders. Some of these observations can are confirmed 
by other experiments such as Table 

Next, we compare using CNN features with an order¬ 
less encoder (FV-CNN) as opposed to the standard FC layer 
(FC-CNN). As seen in Table ^ and Table ^ in PASCAL 
VOC and MIT Indoor the FC-CNN descriptor performs very 
well but in line with previous results for this class of meth¬ 
ods EOl . FV-CNN performs similarly to FC-CNN in PAS¬ 
CAL VOC, KTH-T2b and FMD, but substantially better for 
DTD, OSh-R, and MIT Indoor (e.g. for the latter +5% for 
VGG-M and -1-13% for VGG-VD). 

As a sanity check, results are within 1% of the ones re¬ 
ported in ifT^ and ll20ll for matching experiments on FV- 
SIFT and FC-VGG-M. The differences in case of SIFT LLC 



Fig. 7; Effect of the number of Gaussian components in 
the FV encoder. The figure shows the performance of the 
FV-VGG-M and FV-VGG-VD representations on the OS 
and DTD datasets when the number of Gaussians compo¬ 
nents in the GMM is varied from 1 to 128. Note that the 
abscissa is scaled logarithmically. 

and BoVW are easily explained by the fact that, differently 
from m, our present experiments do not use SPP and im¬ 
age augmentation. 


6.1.4.3 Conclusions. The conclusions of these experiments 
are that: (i) IFV and VLAD are preferable to other orderless 
pooling encoders, that (ii) orderless pooling encoders such 
as the FV are at least as good and often significantly better 
than FC pooling with CNN features. 


6.1.5 CNN descriptor variants comparison 

This section conducts additional experiments on CNN local 
descriptors to find the best variants. 
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dataset 

meas. 

SIFT 

AlexNet 

VGG-M 

VGG-VD 

FV-SIFT 



(%) 

FV 

FC FV FC+FV 

FC FV FC+FV 

FC 

FV FC+FV 

FC+FV-VD 

oOA 

CUReT 

acc 

99.0±o.2 

94.4±o.4 98.5±o.2 99.0±o.2 

94.2±o.3 98.7±o.2 99.1 ±0.2 

94.5±o.4 

99.0±o.2 99.2±o.2 

99.7±o.i 

99.8+0.1 1891 

UMD 

acc 

99.1+0.5 

95.9±o.9 99.7±o.2 99.7±o.3 

97.2+09 99.9+0.1 99.8+0.2 

97.7+0.7 

99.9+0.1 99.9+0.1 

99.9+0.1 

99.7+0.3 (89l 

UIUC 

acc 

96.6+0.8 

91.1±i.7 99.2±o.4 99.3±o.4 

94.5±i.4 99.6±o.4 99.6±o.3 

97.0+0.7 

99.9+0.1 99.9+0.1 

99.9+0.1 

99.4+0.4 |89l 

(a) 

acc 

99.5±o.5 

95.5+13 99.6+0.4 99.8+0.2 

96.1 ±0 9 99.8±o.2 99.9±o.i 

97.9+0.9 

99.8+0.2 99.9+0.1 

100 

99.4+0.4 f8^ 

ALOT 

acc 

94.6±o.3 

86.0±o.4 96.7±o.3 97.8±o.2 

88.7+0.5 97.8+0.2 98.4+0.1 

90.6+0.4 

98.5+0.1 99.0+0.1 

99.3+0.1 

95.9+0.5 (92) 

KTH-T2b 

acc 

70.8+2.7 

71.5±i.3 69.7±3.2 72.1±2.8 

71+2.3 73.3+4.7 73.9+4.9 

75.4+1.5 

81.8±2.5 81.1±2.4 

81.5±2.o 

76.0+2.9 1921 

(b) FMD 

acc 

59.8+1.6 

64.8+1.8 67.7+1.5 71.4+1.7 

70.3±i.8 73.5±2.o 76.6±i.9 

77.4+1.8 

79.8+1.8 82.4+1.5 

82.2±i.4 

57.7+1.7 f86l 

OS+R 

acc 

39.8 

36.8 46.1 49.8 

41.3 52.5 54.9 

43.4 

59.5 60.9 

58.7 

- 

DTD 

acc 

58.6+1.2 

55.1+0.6 62.9+1.4 66.5+1.1 

58.8±o.8 66.8±i.6 69.8±i.i 

62.9±o.8 

72.3±i.o 74.7±i.o 

75.5±o.8 

- 


mAP 

61.3+1.1 

57.7±o.9 66.5±i.4 70.5±i.2 

62.1±o.9 70.8±i.2 74.2±i.i 

67.0+1.1 

76.7±o.8 79.1±o.8 

80.4±o.9 

- 

^ ^ DTD-J 

mAP 

59.6+0.6 

58.4±o.7 65.0±o.9 68.3±o.9 

62.8±o.7 69.8±o.9 72.9±o.9 

67.3+0.9 

75.8±o.6 77.5±o.8 

78.9+0.7 

- 

OSA+R 

mAP 

56.5 

53.9 62.1 64.6 

54.3 65.2 67.9 

49.7 

67.2 67.9 

68.2 

- 

MSRC+R 

acc 

85.7 

83.6 91.7 94.9 

85.0 95.4 96.9 

79.4 

97.7 98.8 

99.1 

- 

MSRC+R 

msrc-acc 

92.0 

84.1 95.0 97.3 

84.0 97.6 98.1 

82.0 

99.2 99.6 

99.5 

- 

(d) VOC07 

mAP 11 

59.9 

74.0 73.1 76.8 

76.8 76.4 79.5 

81.7 

84.9 85.1 

84.5 

85.2 fT05l 

VOC07 

mAP 

60.2 

76.0 75.0 79.0 

79.2 78.7 82.3 

84.6 

88.6 88.5 

87.9 

85.2 11051 

MIT Ind. 

acc 

54.9 

58.6 69.7 71.6 

62.5 74.2 74.4 

67.6 

81.0 80.3 

80.0 

70.8 11091 

CUB 

acc 

17.5 

45.8 49 54.1 

46.1 49.9 54.9 

54.6 

66.7 67.3 

65.4 

73.9* 11071 

CUB+R 

acc 

27.7 

54.5 62.6 65.2 

56.5 65.5 68.1 

62.8 

73.0 74.9 

73.6 

76.37 HOT] 


Table 4: State of the art texture descriptors. The table compares FC-CNN, FV-CNN on three networks trained on ImageNet 
- VGG-M, VGG-VD and AlexNet, and IFV on dense SIFT. We evaluated these descriptors on (a) texture datasets - in 
controlled settings, (b) material datasets (FMD, KTH-T2b, OS+R), (c) texture attributes (DTD, OSA+R) and (d) general cat¬ 
egorisation datasets (MSRCh-R, VOC07, MIT Indoor) and fine grained categorisation (CUB, CUBh-R). For this experiment 
the region support is assumed to be known (and equal to the entire image for all the datasets except OSh-R and MSRCh-R and 
for CUBh-R, where it is set to the bounding box of a bird). *using a model without parts like ours the performance is 62.8%. 


6.1.5.1 Experimental setup. The same setup of the previ¬ 
ous section is used. We compare the performance of FC- 
CNN and FV-CNN local descriptors obtained from VGG-M, 
VGG-VD as well as the simpler AlexNet ifSTll CNN which 
is widely adopted in the literature. 

6.1.5.2 Analysis. Results are presented in detail in Table|^ 
Within that table, the analysis here focuses mainly on tex¬ 
ture and material datasets, but conclusions are similar for the 
other datasets. In general, VGG-M is better than AlexNet 
and VGG-VD is substantially better than VGG-M (e.g. on 
FMD, FC-AlexNet obtains 64.8%, FC-VGG-M obtains 70.3% 
(h- 5.5%), FC-VGG-VD obtains 77.4% (h- 7.1%)). However, 
switching from FC to FV pooling improves the performance 
more than switching to a better CNN (e.g. on DTD going 
from FC-VGG-M to FC-VGG-VD yields a 7.1% improve¬ 
ment, while going from FC-VGG-M to FV-VGG-M yields 

a 11.3% improvement). Combining FV-CNN and FC-CNN 
(by stacking the corresponding image representations) im¬ 
proves the accuracy by 1-2% for VGG-VD, and up to 3-5% 
for VGG-M. There is no significant benefit from adding FV- 
SIFT as well, as the improvement is at most 1%, and in some 
cases (MIT, FMD) it degrades the performance. 

Next, we analyze in detail the effect of depth on the con¬ 
volutional features. Fig. [^reports the accuracy of VGG-M 
and VGG-VD on several datasets for features extracted at 
increasing depths. The pooling method is hxed to FV and 
the number of Gaussian centers K is set such that the over¬ 


all dimensionality of the descriptor 2KDk is constant. For 
both VGG-M and VGG-VD, the improvement with increas¬ 
ing depth is substantial and the best performance is obtained 
by the deepest features (up to 32% absolute accuracy im¬ 
provement in VGG-M and up to 48% in VGG-VD). Perfor¬ 
mance increases at a faster rate up to the third convolutional 
layer (conv3) and then the rate tapers off somewhat. The 
performance of the earlier levels in VGG-VD is much worse 
than the corresponding layers in VGG-M. In fact, the perfor¬ 
mance of VGG-VD matches the performance of the deep¬ 
est (fifth) layer in VGG-M in correspondence of conv5_l, 
which has depth 13. 

Finally, we look at the effect of the number of Gaussian 
components (visual words) in the FV-CNN representation, 
testing possible values in the range 1 to 128 in small (1-16) 
increments. Results are presented in Fig. [7] While there is a 
substantial improvement in moving from one Gaussian com¬ 
ponent to about 64 (up to h- 15% on DTD and up to 6% on 
OS), there is little if any advantage at increasing the number 
of components further. 

6.1.5.3 Conclusions. The conclusions of these experiments 
are as follows: (i) deeper models substantially improve per¬ 
formance; (ii) switching from FC to FV pooling has an ever 
more substantial impact, particularly for deeper models; (iii) 
combining FC and FV pooling has a modest benefit and 
there is no beneht in integrating SIFT features; (iv) in very 
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Fig. 6; Effect of the depth on CNN features. The figure reports the performance of VGG-M (left) and VGG-VD (right) 
local image descriptors pooled with the FV encoder. For each layer the figures shows the size of the receptive field of the 
local descriptors (denoted [N x N]]), as well as, for some of the layers, the dimension D of the local descriptors and the 
number K of visual words in the FV representation (denoted as D x K). Curves for PASCAL VOC, MIT Indoor, FMD, and 
DTD are reported; the performance of using SIFT as local descriptors is reported as a plus (+) mark. 


deep models, most of the performance gain is realized in the 
very last few layers. 

6.1.6 FV pooling vs FC pooling 

In the previous section, we have seen that switching from 
FC to FV pooling may have a substantial impact in certain 
problems. We could find three reasons that can explain this 
difference. 

The first reason is that orderless pooling in FV can be 
more suitable for texture modeliug than the order-sensitive 
pooling in FC. However, this explains the advantage of FV 
in texture recognition but not in object recognition. 

The second reason is that FV pooling may reduce over¬ 
fitting in domain transfer. Pre-trained FC layers could 
be too specialized for the source domain (e.g. ImageNet 
ILSVRC) and there may not be enough training data in the 
target domain to retrain them properly. On the contrary, a 
linear classifier built on FV pooling is less prone to overfit¬ 
ting as it encodes a simpler, smoother classification function 
than a sequence of FC layers in a CNN. This is further in¬ 
vestigated in Sect. |6.2.^ 

The third reason is the ability to easily incorporate infor¬ 
mation from multiple image scales. 

In order to investigate this hypothesis, we evaluated FV- 
CNN by pooling CNN descriptors at a single scale instead 
of multiple ones, for both VGG-M and VGG-VD models. 
For datasets like FMD, DTD and MIT Indoor, FV-CNN at 
a single scale still generally outperforms FC-CNN (columns 
FC (SS) and FV (SS) in Table|g, by up to 5.6% for VGG-M, 
and by up to 9.1% for VGG-VD; however, the difference is 


less marked as using a single scale in FV-CNN looses up to 
3.8% accuracy points and and in some cases the representa¬ 
tions is overtaken by FC-CNN. 

The complementary experiment, namely using multiple 
scales in FC pooling, is less obvious as, by construction, 
FC-CNN resizes the input image to a fixed resolution. How¬ 
ever, we can relax this restriction by computing multiple FC 
representations in a sliding-window manner (also know as a 
“fully-convolutional” network). Then individual representa¬ 
tions computed at multiple locations and, after resizing the 
image, at multiple scales can be averaged in a single repre¬ 
sentation vector. We refer to this as multi-scale FC pooling. 
Multi-scale FC codes perform slightly better than single¬ 
scale FC in most (but not all) cases; however, the benefit 
of using multiple scales is not as large as for multi-scale FV 
pooling, which is still significantly better than multi-scale 
FC. 

6.1.7 Dimensionality reduction of the CNN descriptors 

This section explores the effect of applying dimensionality 
reduction to the CNN local descriptors before FV pooling. 

This experiment investigates the effect of two parame¬ 
ters, the number of Gaussians in the mixture model used 
by the FV encoder, and the dimensionality of the convolu¬ 
tional features, which we reduce using PCA. Various local 
descriptor dimensions are evaluated, from 512 (no PCA) to 
32, reporting mAP on PASCAL VOC 2007, as a function of 
the pooled descriptor dimension. The latter is equal to 2KD, 
where K is the number of Gaussian centers, and D the di¬ 
mensionality of the local descriptor after PCA reduction. 
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dataset 

meas. 

(%) 

VGG-M 

FC (SS) FC (MS) FV (SS) FV (MS) 

VGG-VD 

FC (SS) FC (MS) FV (SS) FV (MS) 

KTH-T2b 

acc 

71 ±2.3 

68.9±3.9 

69.0±2.s 

73.3±4.7 

75.4±i.5 

75.1 ±3.8 

74.5±4.4 

81.8±2.5 

FMD 

acc 

70.3±i.8 

69.3±i.8 

71.6±2.4 

73.5 ±2.0 

77.4±i.8 

78.1±i.7 

79.4±2.5 

79.8±i.8 

DTD 

acc 

58.8±o.8 

59.9±i.i 

62.8 ±1.5 

66.8±i.6 

62.9±o.8 

65.3±i.5 

69.2±o.8 

72.3±i.o 

VOC07 

mAP 11 

76.8 

78 

74.8 

76.4 

81.7 

83.2 

84.7 

84.9 

MIT Ind. 

acc 

62.5 

66.1 

68.1 

74.2 

67.6 

75.3 

76.8 

81.0 


Table 5: The table the single and multi-scale variants of FC-CNN and FV-CNN using two deep CNN, VGG-M and VGG-VD, 
trained on the ImageNet ILSVRC, and a number of representative target datasets. The single scale variants are denoted FC 
(SS) and FV (SS) and the multi-scale variants as FC (MS) and FV (MS). 



Fig. 8; PCA reduced FV-CNN. The figure reports the performance of VGG-M (left) and VGG-VD (right) local descrip¬ 
tors, on PASCAL VOC 2007, when reducing their dimensionality from 512 to up to 32 using PCA in combination with a 
variable number of GMM components. The horizontal axis report the total descriptor dimensionality, proportional to the 
dimensionality of the local descriptors by the number of GMM components. 


Results are presented in Figure [^for VGG-M and VGG- 
VD. It can be noted that, for similar values of the total rep¬ 
resentation dimensionality 2KD, the performance of PCA- 
reduced descriptors is a little better than not using PCA, pro¬ 
vided that this is compensated by a large number of GMM 
components. In particular, similar to what was observed for 
SIFT in iTT], using PCA does improve the performance by 
1-2% mAP point; furthermore, reducing descriptors to 64 or 
80 dimensions appears to result in the best performance. 

6.1.8 Visualization of descriptors 

In this experiment we are interested in understanding which 
GMM components in the FV-CNN representation code for 
a particular concept, as well as in determining which areas 
of the input image contribute the most to the classification 
score. 

In order to do so, let w be the weight vector learned by 
a SVM classifier for a target class using the FB-CNN rep¬ 
resentation as input. We partition w in subvectors w^, one 
for each GMM component k, and rank components by de¬ 
creasing value ||wfc|j, matching the intuition that the GMM 
component that is predictive of the target class will result in 


larger weights. Having identified the top components for a 
target concept, the CNN local descriptors are then extracted 
from a test image, the descriptors that are assigned to a top 
component are selected, and their location is marked on the 
image. To simplify the visualization, features are extracted 
at a single scale. 

As can be noted in Fig. for some indicative texture 
types in DTD, the strongest GMM components do tend to 
fire in correspondence to the characteristic features of each 
texture. Hence, we conclude that GMM components, while 
trained in an unsupervised manner, contain clusters that can 
consistently localize features that capture distinctive charac¬ 
teristics of different texture types. 


6.2 Evaluating texture representations on different domains 


The previous section established optimal combinations of 
local image descriptors and pooling encoders in texture rep¬ 
resentations. This section investigates the applicability of 
these representations to a variety of domains, from texture 
(Sect. 6.2. 1[ ) to object and scene recognition (Sect. 6.2.31. 
It also emphasizes several practical advantages of order- 
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Fig. 9; FV-CNN descriptor visualization. First three rows: Each image shows the location of the CNN local descriptors 
that map to the FV-CNN components most strongly associated with the “wrinkled”, “studded”, “swirly”, “bubbly”, and 
“sprinkled” classes for a number of example images in DTD. Red, green and black marks correspond to the top three 
components selected as described in the text. Last row: Each image was obtained by combining two images, e.g. swirly 
and wrinkled, and we marked the CNN local descriptors associated with the hrst class. Swirly descriptors do not hre on 
the selected wrinkled images. The last pair, studded and bubbly is a harder, as the two images are visually similar, and the 
descriptors corresponding to studded appear on the bubbly image as well. In order to improve visibility, in these images, we 
show only the most discriminative EV component. 


less pooling compared to fully-connected pooling, including 
helping with the problem of domain shift in learned descrip¬ 
tors. This section focuses on problems where the goal is to 
either classify an image as a whole or a known region of 
an image, while texture segmentation is looked at later in 
Sect. EH 


6.2.1 Texture recognition 


Experiments on textures are divided in recognition in con¬ 
trolled conditions (Sect. |6.2.1.3| l, where the main sources of 
variability are viewpoint and illumination, recognition in the 
wild (Sect. 6.2.1.4[ ), characterized by larger intra-class vari¬ 
ations, and recognition in the wild and clutter (Sect. 6.2.1.51, 
where textures are a small portion of a larger scene. 


6.2.1.1 Datasets and evaluation measures. In addition to 
the datasets evaluated in Sect. |6.1[ DTD, OSh-R, FMD and 
KTH-T2b, we consider here also the standard benchmarks 


for texture recognition. CUReT lIZTl (5612 images, 61 classes), 
UIUC El (1000 images, 25 classes), KTH-TIPS HT) (810 
images, 10 classes) are collected in controlled conditions, 
by photographing the same instance of a material, under 
varying scale, viewing angle and illumination. UMD II 1061 
consists of 1000 images, spread across 25 classes, but was 
collected in uncontrolled conditions. Eor these datasets, we 
follow the standard evaluation procedures, that is, we are 
using half of the images for training, and the remaining half 
for testing, and we are reporting accuracy, averaged over 
10 random splits. The ALOT dataset flTl is similar to the 
existing texture datasets, but signihcantly larger, having 250 
categories. Eor our experiments we used the protocol of ll92ll . 
using 20 images per class for training and the rest for testing. 

6.2.1.2 Experimental setup. For the recognition tasks de¬ 
scribed in the following subsections, we compare SIFT, 
VGG-M, and VGG-VD local descriptors and the FC and 
FV pooling encoders as these were determined before to be 
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some of the best representative texture descriptors. Combi¬ 
nations of such descriptors are evaluated as well. 

6.2.13 Texture recognition in controlled conditions. This 
paragraph evaluates texture representations on datasets which 
are collected under controlled condition (Tablej^ section a). 

For instance recognition, CUReT, UIMD, UIUC are sat¬ 
urated by modern techniques such as Il88l[89ll92l . with ac¬ 
curacies above > 99%. There is little difference between 
methods, and FV-SIFT, FV-CNN, and FC-CNN behave sim¬ 
ilarly. KT is also saturated, although FC-CNN looses about 
(3%) accuracy compared to FV-CNN. 

In material recognition, KTH-T2b and ALOT offer a 
somewhat more interesting challenge. First, there is a sig- 
nihcant difference between FC-CNN and FV-CNN (3-6% 
absolute difference in KTH-T2b and 8-10% in ALOT), con¬ 
sistent across all CNN evaluated. Second, CNN descriptors 
are significantly better than SIFT on KTH-T2b and ALOT 
with absolute accuracy gains of up to 11%. 

Compared to the state of the art, FV-SIFT is generally 
very competitive. In KTH-T2b, FV-SIFT outperforms all re¬ 
cent methods 1231 with the exception of l9^ which is based 
on a variant of LBP. The latter is very strong in ALOT too, 
but in this case FV-SIFT is virtually as good. In the case of 
KTH-T2b, ||92]| is better than most of the deep descriptors 
as well, but it is still signihcantly bested by FV-VGG-VD 
(h- 5.5%). Nevertheless, this is an example in which a spe¬ 
cialized texture descriptor can be competitive with deep fea¬ 
tures, although of course deep features apply unchanged to 
several other problems. 

On ALOT, FV-CNN with VGG-VD is on par with the re¬ 
sult obtained by ii - 98.45% - but their model was trained 
with 30 images per class instead of 20. The same paper re¬ 
ports even better results, but when training with 50 images 
per class or by integrating additional synthetic training data. 

6.2.1.4 Texture recognition in the wild. This paragraph eval¬ 
uates the texture representations on two texture datasets col¬ 
lected “in the wild”; FMD (materials) and DTD (describable 
attributes). 

Texture recognition in the wild is more comparable, in 
term of the type of intra-class variations, to object recog¬ 
nition than to texture recognition in controlled conditions. 
Hence, one can expect larger gains in moving from texture- 
specific descriptors to general-purpose descriptors. This is 
conhrmed by the results. SIFT is competitive with AlexNet 
and VGG-M features in FMD (within 3% accuracy), but 
it is significantly worse in DTD (h- 4.3% for FV-AlexNet 
and h- 8.2% for FV-VGG-M). FV-CNN is a little better than 
FC-CNN (~3%) on FMD and substantially better in DTD 
(~8%). Different CNN architectures exhibit very different 
performance; moving from AlexNet to VGG-VD, the ac¬ 
curacy absolute improvement is more than 11% across the 
board. 


Compared to the state of the art, FV-SIFT is generally 
very competitive, outperforming the specialized texture de¬ 
scriptors developed by lISTlIShl in FMD (and this without us¬ 
ing ground-truth texture segmentations as used by il). Yet 
FV-VGG-VD is significantly better than all these descriptors 
(+24.1%). 

In term of complementarity of the features, the combi¬ 
nation of FC-CNN and FV-CNN improves performance by 
about 3% across the board, but including FV-SIFT (labelled 
FV-SIFT/FCh-FV-VD in the table) as well does not seem to 
improve performance further. This is in contrast with the fact 
that SIFT was found to be fairly complementary to FC-CNN 
on a variant of AlexNet in 12^ . 

6.2.1.5 Texture recognition in clutter. This section evaluates 
texture representations on recognizing texture materials and 
describable attributes in clutter. Since there is no standard 
benchmark for this setting, we introduce here the hrst anal¬ 
ysis of this kind using the the OSh-R and OSAh-R datasets of 
Sect. |3.1| Recall that the h-R suffix indicates that, while tex¬ 
tures are imaged in clutter, the classiher is given the ground- 
truth region segmentation; therefore, the goal of this experi¬ 
ment is to evaluate the effect of realistic viewing conditions 
on texture recognition, but the problem of segmenting the 
textures is evaluated later, in Sect. |7.3| 

Results are reported in Table in sections b and c. As 
before, performance improves with the depth of CNNs. For 
example, in material recognition (OSh-R) accuracy starts at 
about 39.1% for FV-SIFT, is about the same for FC-VGG-M 
(41.3%) and a little better for FC-VGG-VD (43.4%). How¬ 
ever, the benefit of switching from FC encoding to FV en¬ 
coding is now even more dramatic. For example, on OSh-R 
FV-VGG-M has accuracy 52.5% (-f 11.2%) while FV-VGG- 
VD 59.5% (-1-16.1%). This clearly demonstrates the advan¬ 
tage of orderless pooling of CNN local descriptors on FC 
pooling when regions of different sizes and shapes must 
be evaluated. There is also a signihcant computational ad¬ 
vantage (evaluated further in Sect. |6.2.3| l if, as it is typical, 
several regions must be classified: in that case, CNN fea¬ 
tures need not to be recomputed for each region. Results on 
OSAh-R are entirely analogous. 

6.2.2 Object and scene recognition 

This section evaluates texture descriptors on tasks other 
than texture recognition, namely coarse and fine-grained ob¬ 
ject categorization, scene recognition, and semantic region 
recognition. 

6.2.2.1 Datasets and evaluation measures. In addition to 
the datasets seen before, here we experiment with hne grained 
recognition in the CUB II 1001 dataset. This dataset contains 







Deep filter banks for texture recognition, description, and segmentation 


21 


CNN 

FC-CNN 

Accuracy (%) 

FV-CNN FCh-FV-CNN 

PLACES 

65.0 

67.6 

73.1 

CAFEE 

58.6 

69.7 

71.6 

VGG-M 

62.5 

74.2 

74.4 

VGG-VD 

67.6 

81.0 

80.3 


Table 6: Accuracy of various CNNs on the MIT indoor 
dataset. PLACES and CAFFE are the same CNN architec¬ 
ture (“AlexNet”) but trained on different datasets (PLACES 
and ImageNet resp.). The domain specific advantage of 
training on PLACES dissapears when the convolutional fea¬ 
tures are used with FV pooling. For all architectures FV 
CNN outperformns FC and better architectures lead to better 
overall performance. 


well as a CNN fine-tuned on the CUB data; by contrast, FV- 
CNN and FC-CNN are used here as global image descrip¬ 
tors which, furthermore, are the same for all the datasets 
considered. Compared to the results of IfToTl without part- 
based descriptors (but still using a part-based object detec¬ 
tor), the best of our global image descriptors perform sub¬ 
stantially better (62.1% vs 67.3%). 

Results on MSRCh-R for semantic segmentation are en¬ 
tirely analogous; it is worth noting that, although ground- 
truth segments are used in this experiment and hence this 
number is not comparable with other reported in the litera¬ 
ture, the best model achieves an outstanding 99.1% per-pixel 
classification rate in this dataset. 


11788 images, representing 200 species of birds. The im¬ 
ages are split approximately into half for training and half 
for testing, according to the list that accompanies the dataset. 
Image representations are either applied to the whole image 
(denoted CUB) or on the region counting the target bird us¬ 
ing ground-truth bounding boxes (CUBh-R). Performance 
in CUB and CUBh-R is reported as per-image classification 
accuracy. For this dataset, the local descriptors are again ex¬ 
tracted at multiple scales, but now only for the smaller range 
{0.5,0.75,1} which was found to work better for this task. 

Performance is also evaluated on the MSRC dataset, 
designed to benchmark semantic segmentation algorithms. 
The dataset contains 591 images, for which some pixels are 
labelled with one of the 23 classes. In order to be consistent 
with the results reported in the literature, performance is re¬ 
ported in term of per-pixel classification accuracy, similar 
to the measure used for the OS task as defined in Sect. 13.11 
However, this measure is further modified such that it is not 
normalized per class: 


acc-msrc(c) 


Up : c(p) = c(p)}| 
IIp : c(p) f 0}| 


(3) 


6.2.2.2 Analysis. Results are reported in Table section d. 
On PASCAL VOC, MIT Indoor, CUB, and CUBh-R the rel¬ 
ative performance of the different descriptors is similar to 
what has been observed above for textures. Compared to the 
state-of-the-art results in each dataset, FC-CNN and particu¬ 
larly the FV-CNN descriptors are very competitive. The best 
result obtained in PASCAL VOC is comparable to the cur¬ 
rent state-of-the-art set by the deep learning method of 11051 
(85.2% vs 84.9% mAP), but using a much more straightfor¬ 
ward pipeline. In MIT Places the best performance is also 
substantially superior (-1-10%) to the current state-of-the-art 
using deep convolutional networks learned on the MIT Place 
dataset 11091 (this is discussed further below). In the CUB 
dataset, the best performance is short (~ 6%) of the state- 
of-the-art results of 11071 . However, II 1 071 uses a category- 
specific part detector and corresponding part descriptor as 


6.2.2.3 Conclusions. The conclusion of this section is that 
FV-CNN, although inspired by texture representations, are 
superior to many alternative descriptors in object and scene 
recognition, including more elaborate constructions. Fur¬ 
thermore, FV-CNN is significantly superior to FC-CNN in 
this case as well. 

6.2.3 Domain transfer 

This section investigates in more detail the problem of do¬ 
main transfer in CNN-based features. So far, the same un¬ 
derlying CNN features, trained on the ImageNet’s ILSVCR 
data, were used in all cases. To investigate the effect of the 
source domain on performance, this section consider, in ad¬ 
dition to these networks, new ones trained on the PLACES 
dataset no9i to recognize scenes on a dataset of about 
2.5 million labeled images. II 1091 showed that, applied to 
the task of scene recognition in MIT Indoor, these fea¬ 
tures outperform similar ones trained on ILSVCR (denoted 
CAFFE HtI below) - a fact explained by the similarity of 
domains. We repeat this experiment using FC- and FV-CNN 
descriptors on top of VGG-M, VGG-VD, PLACES, and 
CAFFE. 

Results are shown in Table|^ The FC-CNN performance 
is in line with those reported in 11091 - in scene recognition 
with FC-CNN the same CNN architecture performs better if 
trained on the Places dataset instead of the ImageNet data 
(58.6% vs 65.0% accuracjQ. Nevertheless, stronger CNN 
architectures such as VGG-M and VGG-VD can approach 
and outperform PLACES even if trained on ImageNet data 
(65.0% vs 62.5%/67.6%). 

However, when it comes to using the filter banks with 
FV-CNN, conclusions are very different. First, FV-CNN out¬ 
performs FC-CNN in all cases, with substantial gains up 
to ~ 11 — 12% in correspondence of a domain transfer 
from ImageNet to MIT Indoor. The gap between FC-CNN 

^ 11091 report 68.3% for PLACES applied to MIT Indoor, a small 
difference explained by implementation details such as the fact that, 
for all the methods, we do not perform data augmentation by jittering. 
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and FV-CNN is the highest for VGG-VD models (67.6% 
vs 81.0%, nearly 14% difference), a trend also exhibited by 
other datasets as seen in Table Second, the advantage of 
using domain-specific CNNs disappears. In fact, the same 
CAFFE model that is 6.4% worse than PLACES with EC- 
CNN, is actually 2.1% better when used in EV-CNN. The 
conclusion is that EV-CNN appears to be immune, or at least 
substantially less sensitive, to domain shifts. 

Our explanation of this phenomenon is that the convo¬ 
lutional features are substantially less committed to a spe- 
cihc dataset than the fully connected layers. Hence, by using 
those, EV-CNN tends to be a lot more general than EC-CNN. 
A second explanation is that PLACES CNN may learn bi¬ 
ters that tend to capture the overall spatial structure of the 
image, whereas CNNs trained on ImageNet tend to focus 
on localized attributes which may work well with orderless 
pooling. 

Einally, we compare EV-CNN to alternative CNN pool¬ 
ing techniques in the literature. The closest method is the 
one of ll40ll . which uses a similar underlying CNN to ex¬ 
tract local image descriptors and VLAD instead of FV for 
pooling. Notably, however, FV-CNN results on MIT Indoor 
are markedly better than theirs for both VGG-M and VGG- 
VD (68.8% vs 74.2% / 81.0% resp.) and marginally better 
(69.7% - Table |4] and when the same CAFFE CNN is 
used. Also, when using VLAD instead of EV for pooling 
the convolutional layer descriptors, the performance of our 
method is still better (68.8% vs 71.2%), as seen in Table 
The key difference is that EV-CNN pools convolutional fea¬ 
tures, whereas iol pools fully connected descriptors ex¬ 
tracted from square image patches. Thus, even without spa¬ 
tial information as used by ll40l . EV-CNN is not only sub¬ 
stantially faster - 8.5x speedup when using the same net¬ 
work and three scales, but at least as accurate. 

7 Experiments on semantic segmentation 

The previous sections considered the problem of recogniz¬ 
ing given image regions. This section explores instead the 
problem of automatically recognizing as well as segmenting 
such regions in the image. 

7.1 Experimental setup 

Inspired by Cimpoi et al. 12^ that successfully ported ob¬ 
ject description methods to texture descriptors, here we pro¬ 
pose a segmentation technique building on ideas from ob¬ 
ject detection. An increasingly popular method for object 
detection, followed for example by FC-CNN jWl . is to brst 
propose a number of candidate object regions using low- 
level image cues, and then verifying a shortlist of such re¬ 
gions using a powerful classiber. Applied to textures, this 


requires a low-level mechanism to generate textured region 
proposals, followed by a region classiber. A key advantage 
of this approach is that it allows applying object- (EC-CNN) 
and texture-like (EV-CNN) descriptors alike. After proposal 
classibcation, each pixel can be assigned more than one la¬ 
bel; this is solved with a simple voting schemes, also in¬ 
spired by object detections methods. 

The paper explores two such region generation methods; 
the crisp regions of m and the Multi-scale Combinato¬ 
rial Grouping (MCG) of El. In both cases, region propos¬ 
als are generated using low-level image cues, such as color 
or texture consistency, as specibed by the original methods. 
It would of course be possible to incorporate EC-CNN and 
FV-CNN among these energy terms to potentially strengthen 
the region generation mechanism itself. However, this con¬ 
tradicts partially the logic of the scheme, which breaks down 
the problem into cheaply generating tentative segmentations 
and then verifying them using a more powerful (and likely 
expensive) model. Furthermore, and more importantly, these 
cues focus on separating texture instances, as presented in 
each particular image, whereas FC-CNN and FV-CNN are 
meant to identify a texture class. It is reasonable to expect 
instance-specibc cues (say the color of a painted wall) to be 
better for segmentation. 

The crisp region method generates a single partition of 
the image; hence, individual pixels are labelled by transfer¬ 
ring the label of the corresponding region, as determined 
by the learned predictor. By contrast, MCG generates many 
thousands overlapping region proposals in an image and re¬ 
quires a mechanism to resolve potentially ambiguous pixel 
labelings. This is done using the following simple scheme. 
For each proposed region, its label is set to the the highest 
scoring class based on the multi-class SVM, and its score 
to the corresponding class score divided by the region area. 
Proposals are then sorted by increasing score and “pasted” 
to the image sequentially. This has the effect of considering 
larger regions before smaller ones and more conbdent re¬ 
gions after less conbdent ones for regions of the same area. 

7.2 Dense-CRF post-processing 

The segmentation results delivered by the previous methods 
can potentially be hampered by the occasional failures of the 
respective front-end superpixel segmentation modules. But 
we can see the front-end segmentation as providing as a con¬ 
venient way of pooling discriminative information, which 
can then be rebned post-hoc through a pixel-level segmen¬ 
tation algorithm. 

In particular, a series of recent works 0 [23 11081 have 
reported that substantial gains can be obtained by combin¬ 
ing CNN classibcation scores with the densely-connected 
Conditional Random Field (Dense-CRF) of ll50l . Apart from 
its ability to incorporate information pertaining to image 
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boundaries and color similarity, the Dense-CRF is particu- 
larily effiecient when used in conjunction with approximate 
probabilistic inference; the message passing updates under 
a fully decomposable mean field approximation can be ex¬ 
pressed as convolutions with a Gaussian kernel in feature 
space, implemented efficiently using high-dimensional fil¬ 
tering m. 

Inspired by these advances, we have employed the Dense- 
CRF segmentation algorithm post-hoc, with the aim of en¬ 
hancing our algorithm’s ability to localize region boundaries 
by taking context and low-level image information into ac¬ 
count. For this we turn the superpixel classification scores 
into pixel-level unary terms, interpeting the SVM classi¬ 
fier’s scores as indicating the negative energy associated to 
labelling each pixel with the respective labels. Even though 
Platt scaling could be used to turn the SVM scores into 
log-probability estimates, we prefer to estimate the trans¬ 
formation by jointly cross-validating the SVM-Dense-CRF 
cascade’s parameters. In particular, similarly to ll^fSOl . we 
set the dense CRF hyperparameters by cross-validation, per¬ 
forming grid search to find the values that perform best on a 
validation set. 


7.3 Analysis 

Results are reported in Table Two datasets are evaluated: 
OS for material recognition and MSRC for things & stuff. 
Compared to OSh-R, classifying crisp regions results in a 
drop of about 10% per-pixel classification accuracy for all 
descriptors. At the same time, it shows that there is ample 
space for future improvements. In MSRC, the best accu¬ 
racy is 87.0%, just a hair above the best published result 
86.5% lf53l . Remarkably, these algorithms do not use any 
dataset-specific training, nor CRF-regularised semantic in¬ 
ference: they simply greedily classify regions as obtained 
from a general-purpose segmentation algorithms. CRF post¬ 
processing improves the results even further, up to 90.2% 
in MSRC. Qualitative segmentation results (sampled at ran¬ 
dom) are given in Fig.[T0|and[TT] 

Results using FV-CNN shown in Tablej^in brackets (due 
to the requirement of computing CNN features from scratch 
for every region, it was impractical to use FC-CNN with 
MCG proposals). The results are comparable to those us¬ 
ing crisp regions, resulting in 55.7% accuracy on the OS 
dataset. Other schemes such as non-maximum suppression 
of overlapping regions that are quite successful for object 
segmentation HTll performed rather poorly in this case. This 
is probably because, unlike objects, texture information is 
fairly localized and highly irregularly shaped in an image. 

While for recognizing textures, materials or objects cov¬ 
ering the entire image, the advantage in computational cost 
of FV-CNN on FC-CNN and was not significant, the lat¬ 
ter consisting in evaluating few layers less, the advantage of 


FV-CNN becomes clear for segmentation tasks, as FC-CNN 
requires recomputing the features for every region proposal. 


8 Applications of describable texture attributes 


This section explores two applications of the DTD attributes: 
using them as general-purpose texture descriptors (Sect. 8.1 1 
and as a tool for search and visualization (Sect|8.2|l. 


8.1 Describable attributes as generic texture descriptors 

This section explores using the 47 describable attributes of 
Sect.j^as a general-purpose texture descriptor. The first step 
in this construction is to learn a multi-class predictor for the 
47 attributes; this predictor is trained on DTD using a texture 
representation of choice and a multi-class linear SVM as be¬ 
fore. The second step is to evaluate the multi-class predictor 
to obtain a 47-dimensional descriptor (of class scores) for 
each image in a target dataset. In this manner, one obtains 
a novel and very compact representation which is then used 
to learn a multi-class non-linear SVM classifier, for example 
for material recognition. 

Results are reported in Table for material recogni¬ 
tion in FMD and KTH-T2b. There are two important fac¬ 
tors in this experiment. The first one is the choice of the 
DTD attributes predictor. Here the best texture representa¬ 
tions found before are evaluated; FV-SIFT, FC-CNN, and 
FV-CNN (using either VGG-M or VGG-VD local descrip¬ 
tors), as well as their combinations. The second one is the 
choice of classifier used to predict a texture material based 
on the 47-dimensional vector of describable attributes. This 
is done using either a linear or RBF SVM. 

Using a linear SVM and FV-SIFT to predict the DTD 
attributes yields promising results: 64.7% classification ac¬ 
curacy on KTH-T2b and 49.2% on FMD. The latter outper¬ 
forms the specialized aLDA model of ll86ll combining color, 
SIFT and edge-slice features, whose accuracy is 44.6%. Re¬ 
placing SIFT with CNN image descriptors (FV-CNN) im¬ 
proves results significantly for FMD (49.2% vs 62.8% for 
VGG-M and 70.8% for VGG-VD) as well as KTH-T2b 
(64.7% vs 67.4% and 74.6% respectively). While these re¬ 
sults are not as good as using the best texture representations 
directly on these datasets, remarkably the dimensionality of 
the DTD descriptors is two orders of magnitude smaller than 
all the other alternatives. 

An advantage of the small dimensionality of the DTD 
descriptors is that using an RBF classifier instead of the lin¬ 
ear one is relatively cheap. Doing so improves the perfor¬ 
mance by 1-3% on both FMD and KTH-T2b across experi¬ 
ments. Overall, the best result of the DTD features on KTH- 
T2b is 77.1% accuracy, slightly better than the state-of-the- 
art accuracy rate of 76.0% of ll9^ . On FMD the DTD fea- 
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Fig. 10; OS material recognition results. Example test image with material recognition and segmentation on the OS dataset, 
(a) original image, (b) ground truth segmentations from the OpenSurfaces repository (note that not all pixels are annotated), 
(c) FC-CNN and crisp-region proposals segmentation results, (d) correctly (green) and incorrectly (red) predicted pixels 
(restricted to the ones annotated), (e-f) the same, but for FV-CNN. 



Fig. 11; MSRC object segmentation results, (a) image, (b) ground-truth, (c-d) FC-CNN segmentation and errors, (e-f) 
FV-CNN segmentation and errors (in red), (g-h) segmentation and errors after Dense CRF post-processing. 
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dataset 

measure (%) 

FC-CNN 

VGG-M 

FV-CNN 

FV-rFC-CNN 

FC-CNN 

VGG-VD 

FV-CNN FC-rFV-CNN 

CRF 

SoA 

OS 

pp-acc 

36.0 

48.6 (46.9) 

49.8 

38.5 

55.5 (55.7) 

55.9 

56.5 

- 

OSA 

acc-osa O 

42.8 

66.0 

63.4 

42.1 

67.9 

64.6 

68.9 

- 

MSRC 

acc-msrc 

56.1 

82.3 

75.2 

57.7 

86.9 

81.5 

90.2 

86.5 (53) 


Table 7: Segmentation and recognition using crisp region proposals of materials (OS) and things & stuff (MSRC). Per-pixel 
accuracies are reported, using the MSRC variant (see text) for the MSRC dataset. Results using MCG proposals ||5l are 
reported in brackets for FV-CNN. 


DTD Classifier 

KTH 

-T2b 

FMD 

Method 

Linear 

RBF 

Linear 

RBF 

FV-SIFT 

64.74±2.36 

67.75±2.89 

49.24±i.73 

52.53±i.26 

FV-CNN 

67.39±3.75 

67.66±3.3o 

62.81±i.33 

64.69±i.4i 

FV-CNN-VD 

74.59±2.45 

74.7I±i.96 

70.81±1.39 

73.09±i.35 

FV-SIFT -r FC-CNN 

73.98±i.24 

74.53±i.i4 

64.20±i.65 

67.13±i.95 

FV-SIFT H- FC-CNN-VD 

74.52±2.3i 

77.14±i.36 

69.21±i.77 

72.17±i.66 

Previous best 

76.0 ± 


57.7±i.7 

(8T][86l 


Table 8; DTD for material recognition. Accuracy on material recognition on the KTH-T2b and FMD benchmarks obtained 
by using as image representation the predictions of the 47 DTD attributes by different methods: FV-SIFT, FV-CNN (using 
either VGG-M or VGG-VD) or combinations. Accuracies are compared to published state of the art results. 


tures outperform significantly the state of the art []: 72.17% 
accuracy vs. 57.7%, an improvement of about 15%. 

The final experiment compares the semantic attributes 
of ll67l on the Outex data. Using FV-SIFT and a linear 
classifier to predict the DTD attributes, performance on the 
retrieval experiment of ll67l is 49.82% mAP which is not 
competitive with their result of 63.3% obtained using LBP“ 
(Sect. 4.1 1 . To verify whether this was due to LBP“ being 
particularly optimized for the Outex data, the DTD attributes 
where trained again using FV on top of the LBP“ local im¬ 
age descriptors; by doing so, using the 47 attributes on Ou¬ 
tex results in an accuracy of 64.5% mAP; at the same time. 
Table shows that LBP“ is not a competitive predictor on 
DTD itself. This confirms the advantage of the LBP“ on the 
Outex dataset. 


8.2 Search and visualization 

This section includes a short qualitative evaluation of the 
DTD attributes. Perhaps their most appealing property is in- 
terpretability; to verify that semantics transfer in a reason¬ 
able way across domains. Fig. shows an excellent se¬ 
mantic correlation between the ten categories in KTH-T2b 
and the attributes in DTD. For example, aluminum foil is 
found to be wrinkled, while bread is found be bumpy, pitted, 
porous and flecked. 

As an additional application of our describable texture 
attributes we compute them on a large dataset of 10,000 
wallpapers and bedding sets from houzz . com The 47 at¬ 
tribute classifiers are learned as in Sect, [fusing the FV-SIFT 
representation and then applied to the 10,000 images to pre¬ 


dict the strength of association of each attribute and image. 
Classifier scores are re-calibrated on the target data and con¬ 
verted to probabilities by rescaling the scores to have a max¬ 
imum value of one on the whole dataset. Fig. 13 shows some 
example attribute predictions, selecting for each of a number 
of attributes an image that has a score close to 1 (excluding 
images used for calibrating the scores), and then including 
additional top two attribute matches. The top two matches 
tend to be a very good description of each texture or pattern, 
while the third is a good match in about half of the cases. 


9 Conclusions 

In this paper we have introduced a dataset of 5,640 images 
collected “in the wild” that have been jointly labelled with 
47 describable texture attributes and have used this dataset 
to study the problem of extracting semantic properties of 
textures and patterns, addressing real-world human-centric 
applications. We have also introduced a novel analysis of 
material and texture attribute recognition in a large dataset 
of textures in clutter derived from the excellent OpenSur- 
faces dataset. Finally, we have analyzed texture representa¬ 
tion in relation to modern deep neural networks. The main 
finding is that orderless pooling of convolutional neural net¬ 
work features is a remarkably good texture descriptor, suffi¬ 
ciently versatile to dub as a scene and object descriptor too 
and resulting in the new state-of-the-art performance in sev¬ 
eral benchmarks. 
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Fig. 12: Descriptions of materials from KTH-T2b dataset. These words are the most frequent top scoring texture attributes 


(from the list of 47 we proposed), when classifying the images from the KTH-T2b dataset. The descriptions are obtained by 


considering the whole material category, while a single image per material is shown for visualization. 



Fig. 13: Bedding sets (top) and wallpapers (bottom) with the top 3 attributes predicted by our classifier and normalized 
classification score in brackets. 
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